merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
415 stars 142 forks source link

[BUG] anvi-get-sequences-for-hmm-hits in combination with --gene-names silently removes genomes #2170

Closed dspeth closed 8 months ago

dspeth commented 8 months ago

Short description of the problem

As stated in the title, anvi-get-sequences-for-hmm-hits in when used with --gene-names for phylogenomics (ie also with --concatenate) silently removes genomes that don't have any hits to the genes provided. In contrast, the --max-num-genes-missing-from-bin parameter informs the user which genomes did not pass the set threshold.

anvi'o version

anvi'o v8. development version

System info

linux install through conda following the instructions on the anvio webpage

Detailed description of the issue

Not sure this is a bug or a feature request, but I would love to have a warning stating which genomes get removed after the hmm hits are filtered for the provided gene names. The use case where I encountered this was a tree with the Ribosomal proteins used by Laura Hug and colleagues for their tree of life. Since these were chosen because they are often colocated, it is quite possible that these genes are all absent from a fragmented genome.

meren commented 8 months ago

I guess explicit and clear warnings at the end of the program, in addition to a flag like --yes-anvio-did-warn-me-that-some-genomes-will-be-removed-from-my-analysis-and-i-will-read-the-program-outputs-carefully would help.

First run, if missing genes in genomes, quit and tell user about it. If they want anvi'o to run regardless ask them to include flag.

dspeth commented 8 months ago

I'm personally completely ok with the same style of the --max-num-genes-missing-from-bin flag, without the program quitting. This is what it looks like when I run this:

WARNING \=============================================== The --max-num-genes-missing-from-bin flag caused the removal of 58 bins (or genomes, whatever) from your analysis. This is the list of bins that will live in our memories: GCA_028817545, GCA_003235785, IMG_3300027386_28, GCA_016178585, GCA_947485305, GCA_027018225, GCA_024281365, GCA_945860965, GCA_025360665, GCA_021604145, GCA_005116815, GCA_022572005, GCA_003228495, GCA_005877575, GCA_007570995, GCA_020853775, GCA_005799375, GCA_016788925, IMG_3300027950_3, GCA_902500715, GCA_023150745, GCA_027021065, GCA_005877565, GCA_027717545, GCA_900299245, GCA_016200355, GCA_016873525, GCA_024998705, GCA_005116965, GCA_012270585, NBOC_1, GCA_005116945, GCA_020028215, GCA_015904025, GCA_002451055, GCA_020028275, GCA_005239595, GCA_013140455, GCA_016195505, GCA_005877505, IMG_3300005466_14, GCA_005808785, GCA_016195485, GCA_013696305, GCA_009836985, GCA_029858925, IMG_3300027332_2, GCA_000196815, GCA_003696975, GCA_002869885, GCA_013388555, GCA_919902955, GCA_016222925, GCA_015664745, GCA_017986555, GCA_028817215, GCA_024331085, IMG_3300027277_3

I think a part of the issue is the luxury that you all have made anvi'o so explicitly verbose in most cases, that the absence of a warning makes me think all is well

meren commented 8 months ago

OK. Let me look into this :)

meren commented 8 months ago

I think this is now addressed, Daan.

meren commented 8 months ago

Please let us know if you see a problem, and/or if it looks like the solution is appropriate.

dspeth commented 8 months ago

works like a charm on my end, and is exactly the type of message that enables someone to figure out what happened to their bins :) Thanks a lot for the quick fix Meren!

meren commented 8 months ago

Thank you very much for pushing for this and the feedback, Daan. I'm glad it is now resolved to your liking :)