Parsing of genus/species names

Hello,

I've been using the SAMSA2 pipeline and it works great for my application.

One thing I've noticed is that the genus/species names reported for Step 5 outputs are parsed using the final two space-separated names in the taxonomy. Most of the time this works well enough (e.g., the output is something like Bacillus subtilis, a proper genus and species pair).

But I seem to have quite a few cases where the output is something like "sp. Root239" or "sp. NRRL", the latter of which is particularly uninformative because NRRL is a type collection and so could really be pointing to anything.

I'm wondering if there is a way to modify the output of the script so that the user can get the full taxonomy? I see that the DIAMOND_general_RefSeq_analysis_counter.py python script deals with this function (around line 132 if I'm reading this correctly?). Maybe even having an option to add a column for taxid in the output here would be useful.

Thanks for any input you have on this! Matt

transcript / samsa2

Parsing of genus/species names #37