Open joverlee521 opened 10 months ago
Testing locally with the --preview
flag:
$ envdir ../env.d/seasonal-flu/ python scripts/delete_records.py -db vdb -v flu_sequences --match "accession:^EPIEPI" --preview
Connected to the "vdb" database
Delete filters: {}
Delete matches: {'accession': '^EPIEPI'}
Delete intervals: {}
Preview: selection would delete 15933 records
Sources of deleted records: {'gisaid'}
One potential issue with this is the sequence accessions are added to the virus records during upload: https://github.com/nextstrain/fauna/blob/dda8186ed47a254e91ade0e87240d5a53e2f046b/vdb/upload.py#L477-L491
So even if we delete the "bad" accession sequence records, they are still listed in the virus records' "sequences" field.
The --overwrite
option for flu_upload will only append new sequences with set_union
.
Functionally, I don't think this is an issue because I cannot find any script that actually uses the "sequences"/"number_sequences" fields from the virus table. It's messy data that annoys me, but I can also ignore it if it's not important to others.
Uses rethinkdb's
match
command to filter for records with field value that matches the provided regex pattern. See rethinkdb docs for more details: https://rethinkdb.com/api/python/match/This was prompted by our need to delete flu sequence records that have accessions with pattern "EPIEPI". We've fixed the accession with https://github.com/nextstrain/fauna/pull/148, but we need to manually remove the old duplicate sequence records because the flu sequence table uses the accession as the index.¹
¹ https://github.com/nextstrain/fauna/blob/ec1feb679715890ae6d14efe11c979f27d6f1d6f/vdb/upload.py#L82
Checklist