sequence records can get linked to the wrong isolate id when the passage type for the same strain name differs
duplicate sequences can appear in the virus table when strain names get renamed in GISAID and later reingested with the same isolate id
As an example of the first issue, two different isolate ids exist for the strain name A/AbuDhabi/240/2018 including a cell-passaged isolate and an egg-passaged isolate. Because we index virus records on strain name, we have only one record with the name A/AbuDhabi/240/2018 and one isolate id EPI_ISL_312868 which is the cell-passaged isolate id. When we include the egg-passaged sequences for this strain in our builds, we report the incorrect isolate id.
As an example of the second issue, at one point the isolate id EPI_ISL_18430014 had a strain name of A/Moscow/MH144681S/2023 which was later renamed to A/Moscow/RII-MH144681S/2023. The isolate id and gene sequence id remain the same, but because we index on strain name, these appeared to be distinct records.
Proposed solution
GISAID distinguishes viruses by their isolate ids and not by their strain names, allowing multiple versions of the same strain to be included in the database. I propose that we follow this data model in the RethinkDB table, too, by changing the viruses index key from strain to isolate_id.
I realize this is a potentially breaking change, but I think we could make it with the following general steps (specifics may vary and be much messier):
Export the entire flu_viruses and flu_sequences tables to disk
Copy the existing tables to backup copies in the database
Delete all records in the original tables
Change the index key in the original flu_viruses table
Import all records from disk into the updated table
Test resolution of duplicates with a download of sequences
Update duplicate resolution logic to account select for latest isolate id by passage type?
Description
We currently index the
virus
table in RethinkDB on the strain name of each isolate. However, this indexing causes at least two problems:virus
table when strain names get renamed in GISAID and later reingested with the same isolate idAs an example of the first issue, two different isolate ids exist for the strain name
A/AbuDhabi/240/2018
including a cell-passaged isolate and an egg-passaged isolate. Because we indexvirus
records on strain name, we have only one record with the nameA/AbuDhabi/240/2018
and one isolate idEPI_ISL_312868
which is the cell-passaged isolate id. When we include the egg-passaged sequences for this strain in our builds, we report the incorrect isolate id.As an example of the second issue, at one point the isolate id EPI_ISL_18430014 had a strain name of
A/Moscow/MH144681S/2023
which was later renamed toA/Moscow/RII-MH144681S/2023
. The isolate id and gene sequence id remain the same, but because we index on strain name, these appeared to be distinct records.Proposed solution
GISAID distinguishes viruses by their isolate ids and not by their strain names, allowing multiple versions of the same strain to be included in the database. I propose that we follow this data model in the RethinkDB table, too, by changing the
viruses
index key fromstrain
toisolate_id
.I realize this is a potentially breaking change, but I think we could make it with the following general steps (specifics may vary and be much messier):
flu_viruses
andflu_sequences
tables to diskflu_viruses
tableAdditional context