nextstrain / measles

Nextstrain build for measles virus
https://nextstrain.org/measles
0 stars 6 forks source link

Primary key shouldn't include version number in GenBank accession #24

Closed trvrb closed 2 months ago

trvrb commented 2 months ago

Current Behavior

Currently, the primary key that's associated with each sample during ingest includes GenBank version ID, eg .1 resulting in metadata like:

accession
Z80832.1
Z80831.1

and sequences FASTA like:

>Z80832.1
ATG...

This necessitates annotations like what I used for PR #23 like:

AF266288.2  strain  Measles strain Edmonston WT
AF266288.2  date    1954
AF266288.2  region  North America
AF266288.2  country USA

However, authors may often update records in GenBank. This would cause the ingested accession to increment and would disassociate our annotation match.

I also think that the .1, .2, etc... adds unnecessary noise when looking at fasta and metadata. GenBank accessions for mpox and SARS-CoV-2 don't include the version ID.

Expected behavior

Drop version ID from GenBank ingest.