steineggerlab / ufcg

UFCG: Universal Fungal Core Genes
https://ufcg.steineggerlab.com
GNU General Public License v3.0
29 stars 0 forks source link

ERROR! Tree file of gene COX2 not found : trees2/aligned_COX2_pro.zZ.fasta.treefile #21

Open salvatierra8 opened 11 months ago

salvatierra8 commented 11 months ago

Greetings,

I getting said error when running the tree command, it also seems the process does not complete because of this. I'm not able to determine what is causing the error. I checked the COX2_pro.zZ.fasta files:

zZ7200270472082568702zZ MYFQDSATPNQEEDGQLRLLDTDTSIVAPVDTHIRFIVSAADVIHDFAIPSLGIKIDACPGRLNQVSALIEREGVFYGQCSELCGVAHSAMPIKLEVVSLPEFLE

zZ5232740519337231783zZ MGRESLVSPRRSRAASARRLLPGLSRRVLTSLLLFSRRRSYGDLSGVSPEICNNGLGCGSPLDTSVAPEGMLGVSRPPALVDSPTSSDDPPSVLPAQNISATHFYVGSNVVRNYGIFLQARNIPGQHFAVHTWSHQYTTSLTNEQVVAEIGWTMQILADNNFGLIPAHWRPRELVSSRFFTSHDSDSSPLPSPRPAYGDVDNRVRAIAREVFGLKTITWNPDEYEARVRGSKSPGLIPLEHVTSVEAVDGELIYVLTPPKYLLLTLYPFVNQSSSRPTLSSSPKAGTSPTSFVASPLAFLPFVLLFLADAKFFLSVFSFSLISGQKPGTRTPTPATSRRVSLQESPSSRLASPMAPLLPWLQQEHPLLDTQNGGAFCCGPEFSWELWERGEGGVGGVGCSWEHSRFGHWFRLGRNELDQIVLIFDTLPRHFDSKTPSELSRPLFLLQPPTLSFSSEDAALSIKQPLQADETDLLDNLVSRLPPLPSPPPSMSISLLPQEVLEPILHLAIQPSTTEGASILLVCTLWHNLGREKLYEHVTLSSQAAYDSYFLLGGSKASWRPLAQAQRTLDYQNLRSLHLRFGPLTKLPFSLSSSNPSPPYLPRFRNLKLIHLDLAKGSSYLSCRPKPMARRVAKLMGGFSPETMILARSSSAEISLSAVMPHHLRRTKYLLLASHHVSHLPSLTPQSCPAMIKNVGFQLLTGVLLPPSALAPLSETTKPSSTRSSPSSSAVGGLQPAVLRDSTFAVCHFTASSVCRMAAVVVSSRQDTSLASRFRSARPNVLFSPPRGLKPSSSQLHPHFPPACSPPVSHLSPRPSPPGYVSPPPSFKDTGSKSCRESRKAKPISSHLSHPFPFLFPSSLPQAFSSTPRSAPSAAIVGNSMLAASGAEGRADLSRWIETQPGSLPTTSTTTSTRRTLPSRSSLPPPRRMYLFPSISGRKGRGADLWSELTSSSSFFFLLLPITLRSSTSSNHPSSTPHSPLHPWILHHRIYHQRHHHQLHSLPLFKTPSTNAFSNLGSLTTSLLAHADEVGVSFDSYMVPDNEIADGQPRLLDVDARVVLPIETHTRFILSSTDVIHDWAVPSLGIKMDAMPGRLNQTSTLIERKGLFFGQCSELCGVYHGFMPIVVEAVELPEYLAWLLAQE

zZ7320208565470240394zZ MYFQDSATPNQEEDGQLRLLDTDTSIVAPVDTHIRFIVSAADVIHDFAIPSLGIKIDACPGRLNQVSALIEREGVFYGQCSELCGVAHSAMPIKLEVVSLPEFLE

The only weird thing that I am able to discern is that the sequence is significantly larger than the others, also with less identity. What could be causing this error?

endixk commented 11 months ago

Hello,

Seems like a false positive hit, which can be resolved by lowering the search sensitivity after I implement the feature as #19.

For now, could you please try to use different tree inference methods (FastTree or RAxML) and see if the issue persists? This will specify which step is failing, between alignment and tree inference.

salvatierra8 commented 11 months ago

Hello,

I forgot to update the topic, I did used Fasttree and it worked. But I will also try the other solution and hopefully to not forget to make a comment about it. Thank you very much!

salvatierra8 commented 11 months ago

So, I just tested the new feature, but so far the default tree option is not working for any of the sensitivity options at least for my data. I have rerun the tree using Raxml without any problem with both default sensitivity and lowest sensitivity option.

endixk commented 10 months ago

Could you check if the same super long COX1 sequence was found from the profile generated with the lowest sensitivity option? If so, I will try to look into the reference gene database of these mitochondrial genes.

salvatierra8 commented 10 months ago

yes it did happen but not with COX anymore but TUB1

JWDebler commented 1 month ago

Hi, I'm getting the same error for another protein:

ERROR! Tree file of gene HEM12 not found : tree/aligned_HEM12_pro.zZ.fasta.treefile

I had a look in the fasta file and there is no false positive hit as in @salvatierra8's case. My sequences all line up nicely. Ran again with raxml and fasttree and it finished without problems.

endixk commented 1 month ago

I recently stumbled into this error using a smaller dataset and found the exact reason why IQ-TREE suffers.

IQ-TREE deduplicates the input MSA, therefore if given MSA contains 3 or less unique alignment rows, the tree won't be produced, which subsequently results in this gene tree not found error.

My recent commit rectifies this issue, and will be included in the next stable release. I suppose a binary compiled with the most recent version won't suffer from this issue anymore.

I would be most appreciated If anyone can test this on your dataset to see whether the issue is fixed.

JWDebler commented 1 week ago

I just ran tree with your recent commit version and got this error:

image

The command used:

ufcg tree -i output_lentis -l label -a nucleotide -t 16 -o output_lentis_tree_nucleotide

Not sure where the -T comes from which it is complaining about. Still happens if I remove the -t 16.

endixk commented 5 days ago

-T option is given internally to set a multi-thread option for iqtree binary. This error should not happen, unless the dependent binary is either not properly installed or updated with this argument removed (which is not likely).

Please check your iqtree installation and try again, and if the error persists, please provide the resulting messages with -dev option given.

JWDebler commented 2 days ago

Looks like it was due to an old version of iqtree installed via apt.

JWDebler commented 2 days ago

OK, next problem :-) The treebuilding step finished correctly, however the final 'cleanup' didn't happen. All the files in the output directory have 'zZ' in their filenames, and the 'label' tag from the metadatafile used during profile has not been applied. All the files instead have strings zZ2641650705628771812zZ etc. Previous successful runs clean up the directory and moved files into subfolders. Can I run the respective commands manually somehow? I just had a look at the prune model, but the run did not produce a .trm file, maybe that is the problem? Cheers

JWDebler commented 1 day ago

This seems to be a problem with the current git version. The version installed via conda (without the iqtree fix) properly processes all the files after the tree building step.

Git version: image

Conda version: image

endixk commented 18 hours ago

@JWDebler I looked into this, and found out that the Maven compiled binary doesn't properly include the GSI calculation package as a dependency. Precompiled JAR (including the conda release) doesn't suffer from this. Confusing part is that the process is finishing without invoking any error.

Since I do not have a source code for this package, I need to find a way to properly include the package into the pom.xml configuration. Until I found out the solution, please use the -G option included from the recent commit, which will turn off the GSI analysis and evade the problem. If you need a GSI annotated tree output, please use the stable conda version.

JWDebler commented 8 hours ago

@endixk Thanks, yep renaming works with this commit. The folder doesn't get cleaned up though, all the files are in the same folder while the previous conda version (1.0.5) organises everything neatly like this: image Just ran the current conda version (1.0.6) and it also didn't clean up the resutls folder.

endixk commented 1 hour ago

The cleaning script is included in the config payload and they'll be gone after the version update. It should work fine after downloading it with ufcg download -t config.