nextstrain / fauna

RethinkDB database to support real-time virus analysis
GNU Affero General Public License v3.0
33 stars 13 forks source link

Fix GISAID flu accessions #148

Closed joverlee521 closed 10 months ago

joverlee521 commented 10 months ago

Description of proposed changes

Digging through git history and found example of FASTA header that suggests GISAID did not used to include the "EPI" prefix in their DNA Accession no. field.¹

This must have changed around October 2023 because we have sequences with "EPIEPI" accessions for sequences submitted after September 27th 2023.²

This commit changes the ingest to only prefix with "EPI" if the accession does not already have the prefix.

¹ https://github.com/nextstrain/fauna/blame/f485baa3621002b3ff6f833c743180239a92bf14/vdb/gisaid_flu_upload.py#L281-L282 ² https://bedfordlab.slack.com/archives/C03KWDET9/p1700609695217959

Checklist

joverlee521 commented 10 months ago

Without this change, the --preview output shows the "EPIEPI" pattern that we are seeing

...
"sequences": [
  "EPIEPI2759658",
  "EPIEPI2759659",
  "EPIEPI2759660",
  "EPIEPI2759661",
  "EPIEPI2759662",
  "EPIEPI2759663",
  "EPIEPI2759664",
  "EPIEPI2759665"
 ],
...

With these changes, the accessions are fixed in the --preview output:

...
"sequences": [
  "EPI2759658",
  "EPI2759659",
  "EPI2759660",
  "EPI2759661",
  "EPI2759662",
  "EPI2759663",
  "EPI2759664",
  "EPI2759665"
 ],
...