Closed joverlee521 closed 1 year ago
Without this change, the --preview
output shows the "EPIEPI" pattern that we are seeing
...
"sequences": [
"EPIEPI2759658",
"EPIEPI2759659",
"EPIEPI2759660",
"EPIEPI2759661",
"EPIEPI2759662",
"EPIEPI2759663",
"EPIEPI2759664",
"EPIEPI2759665"
],
...
With these changes, the accessions are fixed in the --preview
output:
...
"sequences": [
"EPI2759658",
"EPI2759659",
"EPI2759660",
"EPI2759661",
"EPI2759662",
"EPI2759663",
"EPI2759664",
"EPI2759665"
],
...
Description of proposed changes
Digging through git history and found example of FASTA header that suggests GISAID did not used to include the "EPI" prefix in their DNA Accession no. field.¹
This must have changed around October 2023 because we have sequences with "EPIEPI" accessions for sequences submitted after September 27th 2023.²
This commit changes the ingest to only prefix with "EPI" if the accession does not already have the prefix.
¹ https://github.com/nextstrain/fauna/blame/f485baa3621002b3ff6f833c743180239a92bf14/vdb/gisaid_flu_upload.py#L281-L282 ² https://bedfordlab.slack.com/archives/C03KWDET9/p1700609695217959
Checklist