merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
439 stars 145 forks source link

Filter partial gene calls #2102

Closed mschecht closed 1 year ago

mschecht commented 1 year ago

This PR will allow the user to remove hmm-hits that have partial ORFs. This is useful for users interested in finding trustworthy ORFs from metagenomic assemblies. This PR also implements this feature into EcoPhylo.

meren commented 1 year ago

Thank you for these changes, Matt!

An update to anvio/docs/workflows/ecophylo.md would have been great to highlight all the levels of stringency EcoPhylo uses so that it is clear to the users that many things are considered, and they are getting the best out of this tool as it eliminates as many likely bioinformatics artifacts as possible. I feel like it could even be a section of itself, where bullet points would highlight measures taken by EcoPhylo (since we will link that document from everywhere, it would be useful if the first thing people see is something they can relate and quickly understand). I think you (and later others) would benefit from it a lot if the ecophylo.md were to serve as a diary of your journey where you revisit it once in a while, give it another read, and see whether it best reflects your most up-to-date understanding of this approach.

Regardless of this point, please feel free to merge this. Here is one tiny suggestion though: I'd change the help menu item from the following,

"Only keep hmm-hits from open reading frames with start and stop codons."

to the following:

"Partial genes can lead to spurious branches and/or inflate the number of observed populations or functions in a given set of genomes/metagenomes. Using this flag you can instruct anvi'o to only keep HMM hits from open reading frames that represent complete genes (i.e., genes that are not partial and that start with a start codon and end with a stop codon)."

mschecht commented 1 year ago

@meren I integrated your suggestion and updated the ecophylo documentation. I am ready to merge bu if you have a second to comment on the documentation updates that would be great!

meren commented 1 year ago

Thank you, @mschecht! I just made some updates in fed0c30232d824cd16df84ff1bba96c8eb9b5504 for readability and cosmetics. Please feel free to merge it if you agree with those changes.