nextstrain / fauna

RethinkDB database to support real-time virus analysis
GNU Affero General Public License v3.0
33 stars 13 forks source link

script/delete_records: Add option to match fields with regex pattern #151

Open joverlee521 opened 10 months ago

joverlee521 commented 10 months ago

Uses rethinkdb's match command to filter for records with field value that matches the provided regex pattern. See rethinkdb docs for more details: https://rethinkdb.com/api/python/match/

This was prompted by our need to delete flu sequence records that have accessions with pattern "EPIEPI". We've fixed the accession with https://github.com/nextstrain/fauna/pull/148, but we need to manually remove the old duplicate sequence records because the flu sequence table uses the accession as the index.¹

¹ https://github.com/nextstrain/fauna/blob/ec1feb679715890ae6d14efe11c979f27d6f1d6f/vdb/upload.py#L82

Checklist

joverlee521 commented 10 months ago

Testing locally with the --preview flag:

$ envdir ../env.d/seasonal-flu/ python scripts/delete_records.py -db vdb -v flu_sequences --match "accession:^EPIEPI" --preview
Connected to the "vdb" database
Delete filters: {}
Delete matches: {'accession': '^EPIEPI'}
Delete intervals: {}
Preview: selection would delete 15933 records
Sources of deleted records: {'gisaid'}
joverlee521 commented 10 months ago

One potential issue with this is the sequence accessions are added to the virus records during upload: https://github.com/nextstrain/fauna/blob/dda8186ed47a254e91ade0e87240d5a53e2f046b/vdb/upload.py#L477-L491

So even if we delete the "bad" accession sequence records, they are still listed in the virus records' "sequences" field. The --overwrite option for flu_upload will only append new sequences with set_union.

https://github.com/nextstrain/fauna/blob/dda8186ed47a254e91ade0e87240d5a53e2f046b/vdb/upload.py#L606-L612


Functionally, I don't think this is an issue because I cannot find any script that actually uses the "sequences"/"number_sequences" fields from the virus table. It's messy data that annoys me, but I can also ignore it if it's not important to others.