Closed Shicheng-Guo closed 4 years ago
Hi Shicheng,
The analysis scripts will preserve species (there's a reducer.py script that reduces this down to just genus, if that's what you're after), but these scripts can't fix issues in the RefSeq database.
Some RefSeq annotations are not specific to a species level; that's why there are some entries that are attributed just to "Bacteria", or others that go only to genus level, like "Acidovorax sp.". This sequence has been annotated to the Acidovorax genus, but not to a particular species. Since I don't work for NCBI, I can't fix this in the underlying RefSeq database.
Regarding point 3, I'm not sure why Bacteria is being dropped; I'll investigate this and respond to this issue when I've taken a look.
Hi Shicheng,
Bug fixed! Looks like there was an issue with the RefSeq_reduce_to_genus.py script that accidentally dropped the first couple lines (line 57). I've taken this out and it should now properly include Bacteria as a category. I'd suggest re-pulling and running this updated script.
Dear Sam,
Can you provide more details about how
DIAMOND_analysis_counter.py
andRefSeq_reduce_to_genus.py
work to aggregate taxa? I am not python expert. What I find are:1,
xxx.merged.RefSeq_annot_organism.tsv
instep_5_output
has both species and genus terms 2, Not sure whether terms suffix with sp. belongs to species, others belong to genus? 3,RefSeq_reduce_to_genus.py -I xxx.merged.RefSeq_annot_organism.tsv
will generate reduced form, but some genus will be lost. for example, Bacteria is lost in the attached example.Thanks. Shicheng
Shicheng