nextstrain / fauna

RethinkDB database to support real-time virus analysis
GNU Affero General Public License v3.0
33 stars 13 forks source link

Ingest VIDRL human sera references #160

Closed joverlee521 closed 3 months ago

joverlee521 commented 3 months ago

Description of proposed changes

First pass for ingesting VIDRL human sera references, focused on 2024 data. I'm hoping that ingesting older data should be as simple as updating the VACCINE_MAPPING, but we'll see...

Related issue(s)

Related to #158

Checklist

joverlee521 commented 3 months ago

Used this Snakefile to upload 2024 human sera data to test_tdb with changes up to https://github.com/nextstrain/fauna/pull/160/commits/f34c82686f0c5111738cf1813fd4cf9315fea81a

pyenv activate fauna
nextstrain build --cpus 2 --ambient --envdir ../env.d/seasonal-flu/ . --snakefile data/Snakefile --config year='2024' preview=False

This ran through 73 Excel workbooks without raising any errors 🎉
I will dig more into the uploaded data tomorrow to make sure nothing looks too out of place...

joverlee521 commented 3 months ago

Of the 73 processed workbooks, 1 did not have any human sera references. From the other 72 workbooks, the upload workflow added 5722 measurements to test_tdb/flu.

I tested download of the titer measurements using the seasonal-flu workflow with a small patch to use the `test_tdb` database. ```diff diff --git a/workflow/snakemake_rules/download_from_fauna.smk b/workflow/snakemake_rules/download_from_fauna.smk index 64e8350..04544df 100644 --- a/workflow/snakemake_rules/download_from_fauna.smk +++ b/workflow/snakemake_rules/download_from_fauna.smk @@ -68,7 +68,7 @@ rule download_titers: output: titers = "data/{lineage}/{center}_{passage}_{assay}_titers.tsv" params: - dbs = _get_tdb_databases, + dbs = 'test_tdb', assays = _get_tdb_assays, virus_passage_category=_get_virus_passage_category, conda: "../envs/nextstrain.yaml" ``` ``` nextstrain build --envdir ../env.d/seasonal-flu/ . data/h3n2/who_human_cell_hi_titers.tsv data/h3n2/who_human_egg_hi_titers.tsv data/h3n2/who_human_cell_fra_titers.tsv data/h3n2/who_human_egg_fra_titers.tsv data/h1n1pdm/who_human_cell_hi_titers.tsv data/h1n1pdm/who_human_egg_hi_titers.tsv data/vic/who_human_cell_hi_titers.tsv data/vic/who_human_egg_hi_titers.tsv --configfile profiles/upload.yaml ```

This downloaded 5694 measurements that were all appropriately selected for the human host files. There were 28 measurements excluded because the virus_passage_category was egg while the serum_passage_category was cell. The seasonal-flu workflow explicitly excludes egg passaged test viruses in cell passaged titer data.

I manually spot checked 3 workbooks per subtype to verify all of the human sera reference measurements were included. At this point, I'm pretty confident that this at least works for the 2024 files.

joverlee521 commented 3 months ago

I'll plan to merge and upload the 2024 data as part of tomorrow's ingest.