Closed dspeth closed 8 months ago
I guess explicit and clear warnings at the end of the program, in addition to a flag like --yes-anvio-did-warn-me-that-some-genomes-will-be-removed-from-my-analysis-and-i-will-read-the-program-outputs-carefully
would help.
First run, if missing genes in genomes, quit and tell user about it. If they want anvi'o to run regardless ask them to include flag.
I'm personally completely ok with the same style of the --max-num-genes-missing-from-bin
flag, without the program quitting. This is what it looks like when I run this:
WARNING
\===============================================
The --max-num-genes-missing-from-bin
flag caused the removal of 58 bins (or
genomes, whatever) from your analysis. This is the list of bins that will live
in our memories: GCA_028817545, GCA_003235785, IMG_3300027386_28, GCA_016178585,
GCA_947485305, GCA_027018225, GCA_024281365, GCA_945860965, GCA_025360665,
GCA_021604145, GCA_005116815, GCA_022572005, GCA_003228495, GCA_005877575,
GCA_007570995, GCA_020853775, GCA_005799375, GCA_016788925, IMG_3300027950_3,
GCA_902500715, GCA_023150745, GCA_027021065, GCA_005877565, GCA_027717545,
GCA_900299245, GCA_016200355, GCA_016873525, GCA_024998705, GCA_005116965,
GCA_012270585, NBOC_1, GCA_005116945, GCA_020028215, GCA_015904025,
GCA_002451055, GCA_020028275, GCA_005239595, GCA_013140455, GCA_016195505,
GCA_005877505, IMG_3300005466_14, GCA_005808785, GCA_016195485, GCA_013696305,
GCA_009836985, GCA_029858925, IMG_3300027332_2, GCA_000196815, GCA_003696975,
GCA_002869885, GCA_013388555, GCA_919902955, GCA_016222925, GCA_015664745,
GCA_017986555, GCA_028817215, GCA_024331085, IMG_3300027277_3
I think a part of the issue is the luxury that you all have made anvi'o so explicitly verbose in most cases, that the absence of a warning makes me think all is well
OK. Let me look into this :)
I think this is now addressed, Daan.
Please let us know if you see a problem, and/or if it looks like the solution is appropriate.
works like a charm on my end, and is exactly the type of message that enables someone to figure out what happened to their bins :) Thanks a lot for the quick fix Meren!
Thank you very much for pushing for this and the feedback, Daan. I'm glad it is now resolved to your liking :)
Short description of the problem
As stated in the title,
anvi-get-sequences-for-hmm-hits
in when used with--gene-names
for phylogenomics (ie also with--concatenate
) silently removes genomes that don't have any hits to the genes provided. In contrast, the--max-num-genes-missing-from-bin
parameter informs the user which genomes did not pass the set threshold.anvi'o version
anvi'o v8. development version
System info
linux install through conda following the instructions on the anvio webpage
Detailed description of the issue
Not sure this is a bug or a feature request, but I would love to have a warning stating which genomes get removed after the hmm hits are filtered for the provided gene names. The use case where I encountered this was a tree with the Ribosomal proteins used by Laura Hug and colleagues for their tree of life. Since these were chosen because they are often colocated, it is quite possible that these genes are all absent from a fragmented genome.