nextstrain / fauna

RethinkDB database to support real-time virus analysis
GNU Affero General Public License v3.0
33 stars 13 forks source link

Parse VIDRL human sera pool measurements #158

Open huddlej opened 3 months ago

huddlej commented 3 months ago

Description

Summary of the plan for parsing human sera pool measurements from VIDRL spreadsheets:

  1. Parse past VIDRL spreadsheets to find all distinct serum id values for human sera pools, so we know what values need to be mapped to vaccine strain names
  2. Create a TSV file per subtype in fauna that maps human sera pool ids from VIDRL to vaccine strains (e.g., “SH 2024 EGG” to “A/Thailand/8/2022-egg” for H3N2) using seasonal-flu vaccine.json files (e.g., H3N2) as a source of truth
  3. Add logic to tdb/vidrl_upload.py to convert serum_id (e.g., “SH 2024”) and serum_passage (e.g., “EGG”) values from the parsed titer blocks to a key that appears in the TSV file mapping above and use that key to set the serum strain to the vaccine strain name
  4. Run the upload script on past spreadsheets (as dryrun?) and confirm that human sera pool measurements get extract for H1, H3, and Vic
  5. Upload just the new human sera pool measurements to fauna
joverlee521 commented 3 months ago

I'll focus on getting the 2024 human sera pool measurements into fauna before the VCM in September. We will revisit generalizing patterns for ingesting earlier human sera data at a later when there's less time crunch.

joverlee521 commented 3 months ago

Used this Snakefile to backfill all of the human sera data from 2024 with changes from #160.

I have not ingested any of the earlier data from 2023. I expect there will need to be updates for the regexes and definitely updates to the VACCINE_MAPPING for ingesting earlier data.