transcript / samsa2

SAMSA pipeline, version 2.0. An open-source metatranscriptomics pipeline for analyzing microbiome data, built around DIAMOND and customizable reference databases.
GNU General Public License v3.0
54 stars 36 forks source link

Truncated names in RefSeq organism results #16

Closed rachaellappan closed 6 years ago

rachaellappan commented 6 years ago

Hi Sam,

Just a small bug - I've noticed that for some of the organism names the beginning of the name is removed from the organism results.

You can see this in some of your sample files, for example: https://github.com/transcript/samsa2/blob/master/sample_files_paired-end/6_RefSeq_org_results/control_1_TINY.RefSeq_annot_organism.tsv

Line 13 should be 'Prevotella sp. HMSC073D09', line 17 should be 'Bacteroidales bacterium KA00344' and so on. I fortunately haven't run into too many of these in my own results so I can just grep the truncated part to get the full name manually from the database header.

This happens with all of the organism names containing "sp." but also with others, so I can't quite work out why it's grabbing from the middle of the [organism name] from the RefSeq header (is it expecting only two words?).

Cheers, Rachael

transcript commented 6 years ago

Hi Rachael,

Getting around to these smaller bugs - this one's now fixed. Corrected the parser in the python scripts to properly select the organism name as caught. Looks like the program was selecting the second two words in organism name instead of the first two.

Should be corrected everywhere now.

Sam