nextstrain / zika

Nextstrain build for Zika virus
https://nextstrain.org/zika
8 stars 10 forks source link

ingest: fix csvtk quotes #58

Closed joverlee521 closed 2 months ago

joverlee521 commented 2 months ago

The automated ingest workflow failed with a csvtk quoting error.¹ Following https://github.com/nextstrain/docker-base/pull/209, we can now use csvtk fix-quotes and csvtk del-quotes to work around the quoting issue.

¹ https://github.com/nextstrain/zika/actions/runs/8926866948/job/24518932039#step:8:139

Checklist

joverlee521 commented 2 months ago

The error was caused by a new zika record that had internal quotes in the submitter.affiliation:

{"accession": "OR701943.1", "completeness": "PARTIAL", "host": {"lineage": [{"name": "cellular organisms", "taxId": 131567}, {"name": "Eukaryota", "taxId": 2759}, {"name": "Opisthokonta", "taxId": 33154}, {"name": "Metazoa", "taxId": 33208}, {"name": "Eumetazoa", "taxId": 6072}, {"name": "Bilateria", "taxId": 33213}, {"name": "Protostomia", "taxId": 33317}, {"name": "Ecdysozoa", "taxId": 1206794}, {"name": "Panarthropoda", "taxId": 88770}, {"name": "Arthropoda", "taxId": 6656}, {"name": "Mandibulata", "taxId": 197563}, {"name": "Pancrustacea", "taxId": 197562}, {"name": "Hexapoda", "taxId": 6960}, {"name": "Insecta", "taxId": 50557}, {"name": "Dicondylia", "taxId": 85512}, {"name": "Pterygota", "taxId": 7496}, {"name": "Neoptera", "taxId": 33340}, {"name": "Endopterygota", "taxId": 33392}, {"name": "Diptera", "taxId": 7147}, {"name": "Nematocera", "taxId": 7148}, {"name": "Culicomorpha", "taxId": 43786}, {"name": "Culicoidea", "taxId": 41827}, {"name": "Culicidae", "taxId": 7157}, {"name": "Culicinae", "taxId": 43817}, {"name": "Aedini", "taxId": 1056966}, {"name": "Aedes", "taxId": 7158}, {"name": "Stegomyia", "taxId": 53541}, {"name": "Aedes aegypti", "taxId": 7159}], "organismName": "Aedes aegypti", "taxId": 7159}, "isAnnotated": true, "isolate": {"collectionDate": "2021-11-11", "name": "6PYUC2022"}, "length": 217, "location": {"geographicLocation": "Mexico: Yucatan, Merida", "geographicRegion": "North America"}, "nucleotide": {"sequenceHash": "6FD6033C"}, "proteinCount": 1, "releaseDate": "2024-05-01T00:00:00Z", "sourceDatabase": "GenBank", "submitter.affiliation": "Centro de Investigaciones Regionales \"Dr. Hideyo Noguchi\", Laboratorio de Arbovirologia", "submitter.country": "Mexico", "submitter.names": ["Argaez-Sierra,D.G.", "Baak-Baak,C.M.", "Cigarroa-Toledo,N.", "Garcia-Rejon,J.E.", "Tzuc-Dzul,J.C.", "Acosta-Viana,K.Y.", "Nunez-Corea,D.A."], "updateDate": "2024-05-01T00:00:00Z", "virus": {"lineage": [{"name": "Viruses", "taxId": 10239}, {"name": "Riboviria", "taxId": 2559587}, {"name": "Orthornavirae", "taxId": 2732396}, {"name": "Kitrinoviricota", "taxId": 2732406}, {"name": "Flasuviricetes", "taxId": 2732462}, {"name": "Amarillovirales", "taxId": 2732545}, {"name": "Flaviviridae", "taxId": 11050}, {"name": "Orthoflavivirus", "taxId": 3044782}, {"name": "Orthoflavivirus zikaense", "taxId": 3048459}, {"name": "Zika virus", "taxId": 64320}], "organismName": "Zika virus", "taxId": 64320}}

I confirmed locally that the output for format_ncbi_dataset_report has the correct quoting in submitter-affiliation.

accession   accession-rev   sourcedb    sra-accs    isolate-lineage geo-region  geo-location    isolate-collection-date release-date    update-date length  host-name   isolate-lineage-source  biosample-acc   submitter-names submitter-affiliation   submitter-country
OR701943    OR701943.1  GenBank     6PYUC2022   North America   Mexico: Yucatan, Merida 2021-11-11  2024-05-01T00:00:00Z    2024-05-01T00:00:00Z    217 Aedes aegypti           Argaez-Sierra,D.G.,Baak-Baak,C.M.,Cigarroa-Toledo,N.,Garcia-Rejon,J.E.,Tzuc-Dzul,J.C.,Acosta-Viana,K.Y.,Nunez-Corea,D.A.    Centro de Investigaciones Regionales "Dr. Hideyo Noguchi", Laboratorio de Arbovirologia Mexico

The final produced metadata.tsv has double quoting in the institution column, but this is due to an augur curate passthru bug.

genbank_accession   genbank_accession_rev   strain  date    region  country division    location    length  host    release_date    update_date sra_accessions  authors institution
OR701943    OR701943.1  6PYUC2022   2021-11-11  North America   Mexico  Yucatan Merida  217 Aedes aegypti   2024-05-01  2024-05-01      Argaez-Sierra et al "Centro de Investigaciones Regionales ""Dr. Hideyo Noguchi"", Laboratorio de Arbovirologia"
joverlee521 commented 2 months ago

Merging to get our ingest going again, but I'll loop back to the augur curate issue`.

joverlee521 commented 2 months ago

Manually triggered ingest-to-phylogenetic