nhoffman / ya16sdb

A curated subset of 16S rRNA sequences from NCBI
2 stars 3 forks source link

genome identifier #36

Closed nhoffman closed 1 year ago

nhoffman commented 4 years ago

We need to define an identifier that can be used to group records by genome sequencing project: for shotgun assemblies, accessions refer to a contig, and as a result multiple accessions can refer to the same assembly.

see also #32

marykstewart commented 4 years ago

You may already have tried this, but just in case not:
The whole genome project number (AGXQ01 or QSCG01 for the two test assemblies I was just using, see attachments) pulls up the assembly in the NCBI assembly database and the contigs in the nucleotide database (plus the whole genome project record, which doesn't have sequence directly associated with it). So it seems like that might serve as a bridge. The GCF accession (only refseqs have this) will also do this, but the GCA accession will not (both refseqs and non-refseqs have). Could be argued that assemblies that don't qualify for refseq status should be excludable anyway (QSCG01 is one of those, the reasons for exclusion from refseq are in the record summary and don't inspire confidence).

AGXQ01_report QSCG01_report

crosenth commented 2 years ago

The release 0.7.3 seq_info.csv includes two columns with wgs accessions "master" and "assembly_genbank"

crosenth commented 1 year ago

Anything left on this Issue?