nextstrain / fauna

RethinkDB database to support real-time virus analysis
GNU Affero General Public License v3.0
33 stars 13 forks source link

Revisit tdb/upload's `index_fields` #144

Open joverlee521 opened 8 months ago

joverlee521 commented 8 months ago

Context

Currently all titer data uploaded to fauna use the tdb/upload's index_fields to create the record index.

If there are any changes in these field values, then the generated index is no longer the same as previous uploads and duplicate records get added to the database. This means if we encounter changes in these values we need to delete the old records and then upload the new records. See example discussed in https://github.com/nextstrain/fauna/issues/126#issuecomment-1241297382 and latest example on Slack.

If the index fields are too specific, then we have to be wary of more data changes. We see this in titer uploads that use tdb/elife_upload under the hood. Within elife_upload, a row counter is appended to the source field, so a change in the order of records can create duplicate records.

If the index fields are not specific enough, then records with the same data can overwrite each other. We see this in the CDC titer uploads, which do not append the row counter to the source. Records that have different passage details that get categorized as the same passage category (e.g. both "S1" and "S3" -> "cell") will create the same index and thus overwrite each other in the database.

Possible solution

Solutions I can think of at the moment, but would love to hear other ideas:

  1. Specify different index fields per CC's titer upload to tailor them to the different data. However, this may be even more confusing in the long run because we'd have to be wary of different field changes per CC.
  2. Move away from fauna/rethinkdb for titer data. I think we'd create a standardized TSV per Excel/TSV file that we receive and they can all be concatenated into one TSV as our central "database". Then we'd only have to be wary of changes per file instead of changes in specific fields. However, this will mean we'd have to reconstruct the central "database" every time we get new data.
huddlej commented 8 months ago

Move away from fauna/rethinkdb for titer data

+many for this.