nextstrain / fauna

RethinkDB database to support real-time virus analysis
GNU Affero General Public License v3.0
33 stars 13 forks source link

seasonal flu "virus" records should be indexed by isolate id instead of strain name #165

Open huddlej opened 1 month ago

huddlej commented 1 month ago

Description

We currently index the virus table in RethinkDB on the strain name of each isolate. However, this indexing causes at least two problems:

  1. sequence records can get linked to the wrong isolate id when the passage type for the same strain name differs
  2. duplicate sequences can appear in the virus table when strain names get renamed in GISAID and later reingested with the same isolate id

As an example of the first issue, two different isolate ids exist for the strain name A/AbuDhabi/240/2018 including a cell-passaged isolate and an egg-passaged isolate. Because we index virus records on strain name, we have only one record with the name A/AbuDhabi/240/2018 and one isolate id EPI_ISL_312868 which is the cell-passaged isolate id. When we include the egg-passaged sequences for this strain in our builds, we report the incorrect isolate id.

As an example of the second issue, at one point the isolate id EPI_ISL_18430014 had a strain name of A/Moscow/MH144681S/2023 which was later renamed to A/Moscow/RII-MH144681S/2023. The isolate id and gene sequence id remain the same, but because we index on strain name, these appeared to be distinct records.

Proposed solution

GISAID distinguishes viruses by their isolate ids and not by their strain names, allowing multiple versions of the same strain to be included in the database. I propose that we follow this data model in the RethinkDB table, too, by changing the viruses index key from strain to isolate_id.

I realize this is a potentially breaking change, but I think we could make it with the following general steps (specifics may vary and be much messier):

  1. Export the entire flu_viruses and flu_sequences tables to disk
  2. Copy the existing tables to backup copies in the database
  3. Delete all records in the original tables
  4. Change the index key in the original flu_viruses table
  5. Import all records from disk into the updated table
  6. Test resolution of duplicates with a download of sequences
  7. Update duplicate resolution logic to account select for latest isolate id by passage type?

Additional context