Open luizirber opened 4 years ago
Also: backport the solution to https://github.com/dib-lab/sourmash_databases
sourmash compute
is using--name-from-first
, which might pick up the wrong name. Some bacterial genomes in genbank have plasmids, and if they are the first sequence it's going to mess up classification.
so, that might not be so big of an issue: accession2taxid
, the mapping we use, points to the taxid of the assembled organism (and not to a plasmid), even if the contig is from a plasmid inside the assembly.
Need to do more checking, but anecdotal example:
https://www.ncbi.nlm.nih.gov/nuccore/NZ_LNUS01000004.1 maps to https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=1282, despite the name of the sequence being Staphylococcus epidermidis strain AU23 plasmid unnamed contig_5, whole genome shotgun sequence
(granted, I grepped for plasmid
and then looked in nucl_wgs.accession2taxid.gz
for the mapping. NEED MORE TESTING)
sourmash compute
is using--name-from-first
, which might pick up the wrong name. Some bacterial genomes in genbank have plasmids, and if they are the first sequence it's going to mess up classification.Similar issue in ncbi-genome-download: https://github.com/kblin/ncbi-genome-download/issues/110#issuecomment-591830174
@taylorreiter also hit similar issues before.
A simple check: for each sig, based on filename (which matches the genbank accession), see if the sig name contains the same accession or a different one.
A solution: change the
snakemake
rule to find the proper name in the file first, and then set it with--name
insourmash compute