luizirber / 2020-cami

Preparing sourmash for CAMI 2 evaluations
3 stars 1 forks source link

Deal with plasmids in assemblies #1

Open luizirber opened 4 years ago

luizirber commented 4 years ago

sourmash compute is using --name-from-first, which might pick up the wrong name. Some bacterial genomes in genbank have plasmids, and if they are the first sequence it's going to mess up classification.

Similar issue in ncbi-genome-download: https://github.com/kblin/ncbi-genome-download/issues/110#issuecomment-591830174

@taylorreiter also hit similar issues before.

A simple check: for each sig, based on filename (which matches the genbank accession), see if the sig name contains the same accession or a different one.

A solution: change the snakemake rule to find the proper name in the file first, and then set it with --name in sourmash compute

luizirber commented 4 years ago

Also: backport the solution to https://github.com/dib-lab/sourmash_databases

luizirber commented 4 years ago

sourmash compute is using --name-from-first, which might pick up the wrong name. Some bacterial genomes in genbank have plasmids, and if they are the first sequence it's going to mess up classification.

so, that might not be so big of an issue: accession2taxid, the mapping we use, points to the taxid of the assembled organism (and not to a plasmid), even if the contig is from a plasmid inside the assembly.

Need to do more checking, but anecdotal example: https://www.ncbi.nlm.nih.gov/nuccore/NZ_LNUS01000004.1 maps to https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=1282, despite the name of the sequence being Staphylococcus epidermidis strain AU23 plasmid unnamed contig_5, whole genome shotgun sequence (granted, I grepped for plasmid and then looked in nucl_wgs.accession2taxid.gz for the mapping. NEED MORE TESTING)