Currently, the primary key that's associated with each sample during ingest includes GenBank version ID, eg .1 resulting in metadata like:
accession
Z80832.1
Z80831.1
and sequences FASTA like:
>Z80832.1
ATG...
This necessitates annotations like what I used for PR #23 like:
AF266288.2 strain Measles strain Edmonston WT
AF266288.2 date 1954
AF266288.2 region North America
AF266288.2 country USA
However, authors may often update records in GenBank. This would cause the ingested accession to increment and would disassociate our annotation match.
I also think that the .1, .2, etc... adds unnecessary noise when looking at fasta and metadata. GenBank accessions for mpox and SARS-CoV-2 don't include the version ID.
Current Behavior
Currently, the primary key that's associated with each sample during ingest includes GenBank version ID, eg
.1
resulting in metadata like:and sequences FASTA like:
This necessitates annotations like what I used for PR #23 like:
However, authors may often update records in GenBank. This would cause the ingested accession to increment and would disassociate our annotation match.
I also think that the
.1
,.2
, etc... adds unnecessary noise when looking at fasta and metadata. GenBank accessions for mpox and SARS-CoV-2 don't include the version ID.Expected behavior
Drop version ID from GenBank ingest.