transcript / samsa2

SAMSA pipeline, version 2.0. An open-source metatranscriptomics pipeline for analyzing microbiome data, built around DIAMOND and customizable reference databases.
GNU General Public License v3.0
54 stars 36 forks source link

How DIAMOND_analysis_counter.py works? #46

Closed Shicheng-Guo closed 4 years ago

Shicheng-Guo commented 4 years ago

Dear Sam,

Can you provide more details about how DIAMOND_analysis_counter.py and RefSeq_reduce_to_genus.py work to aggregate taxa? I am not python expert. What I find are:

1, xxx.merged.RefSeq_annot_organism.tsv in step_5_output has both species and genus terms 2, Not sure whether terms suffix with sp. belongs to species, others belong to genus? 3, RefSeq_reduce_to_genus.py -I xxx.merged.RefSeq_annot_organism.tsv will generate reduced form, but some genus will be lost. for example, Bacteria is lost in the attached example.

Thanks. Shicheng

SP-Reduceto-Genus-SAMSA2-2020-Shicheng-Guo-Strain-Skin

Shicheng

transcript commented 4 years ago

Hi Shicheng,

The analysis scripts will preserve species (there's a reducer.py script that reduces this down to just genus, if that's what you're after), but these scripts can't fix issues in the RefSeq database.

Some RefSeq annotations are not specific to a species level; that's why there are some entries that are attributed just to "Bacteria", or others that go only to genus level, like "Acidovorax sp.". This sequence has been annotated to the Acidovorax genus, but not to a particular species. Since I don't work for NCBI, I can't fix this in the underlying RefSeq database.

Regarding point 3, I'm not sure why Bacteria is being dropped; I'll investigate this and respond to this issue when I've taken a look.

transcript commented 4 years ago

Hi Shicheng,

Bug fixed! Looks like there was an issue with the RefSeq_reduce_to_genus.py script that accidentally dropped the first couple lines (line 57). I've taken this out and it should now properly include Bacteria as a category. I'd suggest re-pulling and running this updated script.