neherlab / pan-genome-analysis

Processing pipeline for pan-genome visulization and exploration
http://pangenome.de
GNU General Public License v3.0
132 stars 37 forks source link

Core genome not found even with low soft core parameter in similar genera. Error in step06 & step08 #42

Closed ashapter closed 3 years ago

ashapter commented 3 years ago

Hello all,

I have successfully run panX analyses on three different individual genera using, ./panX.py -fn data/myGenus -sl myGenus -t 2

I tried to run all three genera together with a total of 71 genomes (majority of which are draft genomes). It returned this as an error: ====== starting step06: align genes in geneCluster by mafft and build gene trees Traceback (most recent call last): File "./panX.py", line 287, in <module> myPangenome.process_clusters() File "/disk3/pan-genome-analysis/scripts/pangenome_computation.py", line 180, in process_clusters myClusterCollector.estimate_raw_core_diversity() File "/disk3/pan-genome-analysis/scripts/cluster_collective_processing.py", line 17, in estimate_raw_core_diversity self.folders_dict, self.strain_list, self.threads, self.core_genome_threshold, self.factor_core_diversity, self.species) File "/disk3/pan-genome-analysis/scripts/sf_core_diversity.py", line 102, in estimate_core_gene_diversity calculated_core_diversity=tmp_average_core_diversity(tmp_core_seq_path) File "/disk3/pan-genome-analysis/scripts/sf_core_diversity.py", line 42, in tmp_average_core_diversity with open(file_path+'tmp_core_diversity.txt', 'r') as tmp_core_diversity_file: IOError: [Errno 2] No such file or directory: 'pan-genome-analysis/data/Chloro_Cocco_Prasino/protein_faa/diamond_matches/tmp_core/tmp_core_diversity.txt'

I followed #8 thread and ran these genomes with -cg 0.7,0.5,0.3,0.1 and was unsuccessful.

Error messages were all similar to: ====== starting step08: run fasttree and raxml for tree construction fasttree time-cost: 0.26 minutes (15.88 seconds) RAxML tree optimization within the timelimit of 30 minutes RAxML branch length optimization and rooting Traceback (most recent call last): File "./panX.py", line 303, in <module> myPangenome.build_core_tree() File "/data/tools/pan-genome-analysis/scripts/pangenome_computation.py", line 200, in build_core_tree aln_to_Newick(self.path, self.folders_dict, self.raxml_max_time, self.raxml_path, self.threads) File "/data/tools/pan-genome-analysis/scripts/sf_core_tree_build.py", line 75, in aln_to_Newick shutil.copy('RAxML_result.branches', out_fname) File "/anaconda2/envs/panX/lib/python2.7/shutil.py", line 139, in copy copyfile(src, dst) File "/anaconda2/envs/panX/lib/python2.7/shutil.py", line 96, in copyfile with open(src, 'rb') as fsrc: IOError: [Errno 2] No such file or directory: 'RAxML_result.branches'

The raxml.log reads: `Option -T does not have any effect with the sequential or parallel MPI version. It is used to specify the number of threads for the Pthreads-based parallelization

RAxML can't, parse the alignment file as phylip file it will now try to parse it as FASTA file

ERROR: Sequence EhV145 consists entirely of undetermined values which will be treated as missing data ERROR: Sequence EhV156 consists entirely of undetermined values which will be treated as missing data ERROR: Sequence EhV164 consists entirely of undetermined values which will be treated as missing data ERROR: Sequence EhV18 consists entirely of undetermined values which will be treated as missing data ERROR: Sequence EhV201 consists entirely of undetermined values which will be treated as missing data ERROR: Sequence EhV202 consists entirely of undetermined values which will be treated as missing data ERROR: Sequence EhV203 consists entirely of undetermined values which will be treated as missing data ERROR: Sequence EhV207 consists entirely of undetermined values which will be treated as missing data ERROR: Sequence EhV208 consists entirely of undetermined values which will be treated as missing data ERROR: Sequence EhV84 consists entirely of undetermined values which will be treated as missing data ERROR: Sequence EhV86 consists entirely of undetermined values which will be treated as missing data ERROR: Sequence EhV88 consists entirely of undetermined values which will be treated as missing data ERROR: Sequence EhV99B1 consists entirely of undetermined values which will be treated as missing data ERROR: Found 13 sequences that consist entirely of undetermined values, exiting...`

Unlike johannes from #8 I can't just delete these sequences because I am working with a somewhat small number of genomes and the raxml.log is actually reporting all sequences of one genera.

I then did a three way pairwise comparison for the different genera using ./panX.py -fn data/myGenus -sl myGenus -t 2 By this I mean, analyses with the genomes from genera A and B together were successful, A and C together were successful, B and C together were successful, but A, B, C together were unsuccessful.

Since the different analyses I compared generated core genomes that are present in 100% of the strains, there should be a core genome between all 3 genera. Any thoughts on what else I could do to try and fix this? Any help would be appreciated. Please let me know if there is more info I could provide. Thank you.

ashapter commented 3 years ago

Update: Despite the pairwise comparisons successfully finding a core genome in every case, the issue of unsuccessfully finding a core genome across the three genera came down to the diversity being too high in the three genera comparison. Upon closer inspection of the clades, it was found that there is an unusually high diversity in one genus. Even though currently the taxonomy claims all constituent members of this clade belong together, by removing one of the subclades, the three genera comparison worked out.