Core genome not found even with low soft core parameter in similar genera. Error in step06 & step08

Hello all,

I have successfully run panX analyses on three different individual genera using, ./panX.py -fn data/myGenus -sl myGenus -t 2

I tried to run all three genera together with a total of 71 genomes (majority of which are draft genomes). It returned this as an error: ====== starting step06: align genes in geneCluster by mafft and build gene trees Traceback (most recent call last): File "./panX.py", line 287, in <module> myPangenome.process_clusters() File "/disk3/pan-genome-analysis/scripts/pangenome_computation.py", line 180, in process_clusters myClusterCollector.estimate_raw_core_diversity() File "/disk3/pan-genome-analysis/scripts/cluster_collective_processing.py", line 17, in estimate_raw_core_diversity self.folders_dict, self.strain_list, self.threads, self.core_genome_threshold, self.factor_core_diversity, self.species) File "/disk3/pan-genome-analysis/scripts/sf_core_diversity.py", line 102, in estimate_core_gene_diversity calculated_core_diversity=tmp_average_core_diversity(tmp_core_seq_path) File "/disk3/pan-genome-analysis/scripts/sf_core_diversity.py", line 42, in tmp_average_core_diversity with open(file_path+'tmp_core_diversity.txt', 'r') as tmp_core_diversity_file: IOError: [Errno 2] No such file or directory: 'pan-genome-analysis/data/Chloro_Cocco_Prasino/protein_faa/diamond_matches/tmp_core/tmp_core_diversity.txt'

I followed #8 thread and ran these genomes with -cg 0.7,0.5,0.3,0.1 and was unsuccessful.

Error messages were all similar to: ====== starting step08: run fasttree and raxml for tree construction fasttree time-cost: 0.26 minutes (15.88 seconds) RAxML tree optimization within the timelimit of 30 minutes RAxML branch length optimization and rooting Traceback (most recent call last): File "./panX.py", line 303, in <module> myPangenome.build_core_tree() File "/data/tools/pan-genome-analysis/scripts/pangenome_computation.py", line 200, in build_core_tree aln_to_Newick(self.path, self.folders_dict, self.raxml_max_time, self.raxml_path, self.threads) File "/data/tools/pan-genome-analysis/scripts/sf_core_tree_build.py", line 75, in aln_to_Newick shutil.copy('RAxML_result.branches', out_fname) File "/anaconda2/envs/panX/lib/python2.7/shutil.py", line 139, in copy copyfile(src, dst) File "/anaconda2/envs/panX/lib/python2.7/shutil.py", line 96, in copyfile with open(src, 'rb') as fsrc: IOError: [Errno 2] No such file or directory: 'RAxML_result.branches'

The raxml.log reads: `Option -T does not have any effect with the sequential or parallel MPI version. It is used to specify the number of threads for the Pthreads-based parallelization

RAxML can't, parse the alignment file as phylip file it will now try to parse it as FASTA file

ERROR: Sequence EhV145 consists entirely of undetermined values which will be treated as missing data ERROR: Sequence EhV156 consists entirely of undetermined values which will be treated as missing data ERROR: Sequence EhV164 consists entirely of undetermined values which will be treated as missing data ERROR: Sequence EhV18 consists entirely of undetermined values which will be treated as missing data ERROR: Sequence EhV201 consists entirely of undetermined values which will be treated as missing data ERROR: Sequence EhV202 consists entirely of undetermined values which will be treated as missing data ERROR: Sequence EhV203 consists entirely of undetermined values which will be treated as missing data ERROR: Sequence EhV207 consists entirely of undetermined values which will be treated as missing data ERROR: Sequence EhV208 consists entirely of undetermined values which will be treated as missing data ERROR: Sequence EhV84 consists entirely of undetermined values which will be treated as missing data ERROR: Sequence EhV86 consists entirely of undetermined values which will be treated as missing data ERROR: Sequence EhV88 consists entirely of undetermined values which will be treated as missing data ERROR: Sequence EhV99B1 consists entirely of undetermined values which will be treated as missing data ERROR: Found 13 sequences that consist entirely of undetermined values, exiting...`

Unlike johannes from #8 I can't just delete these sequences because I am working with a somewhat small number of genomes and the raxml.log is actually reporting all sequences of one genera.

I then did a three way pairwise comparison for the different genera using ./panX.py -fn data/myGenus -sl myGenus -t 2 By this I mean, analyses with the genomes from genera A and B together were successful, A and C together were successful, B and C together were successful, but A, B, C together were unsuccessful.

Since the different analyses I compared generated core genomes that are present in 100% of the strains, there should be a core genome between all 3 genera. Any thoughts on what else I could do to try and fix this? Any help would be appreciated. Please let me know if there is more info I could provide. Thank you.

neherlab / pan-genome-analysis

Core genome not found even with low soft core parameter in similar genera. Error in step06 & step08 #42