Open mattjmeier opened 4 years ago
Hi Matt,
This is a good suggestion, and I'll tag this as enhancement - it shouldn't be too difficult to add another parameter to capture the full name when extracting from the RefSeq database. Feel free to submit a PR if you want to tackle this, or I'll work on it when I have a chance and will update this ticket.
Best, Sam
Hello,
I've been using the SAMSA2 pipeline and it works great for my application.
One thing I've noticed is that the genus/species names reported for Step 5 outputs are parsed using the final two space-separated names in the taxonomy. Most of the time this works well enough (e.g., the output is something like Bacillus subtilis, a proper genus and species pair).
But I seem to have quite a few cases where the output is something like "sp. Root239" or "sp. NRRL", the latter of which is particularly uninformative because NRRL is a type collection and so could really be pointing to anything.
I'm wondering if there is a way to modify the output of the script so that the user can get the full taxonomy? I see that the DIAMOND_general_RefSeq_analysis_counter.py python script deals with this function (around line 132 if I'm reading this correctly?). Maybe even having an option to add a column for taxid in the output here would be useful.
Thanks for any input you have on this! Matt