reichlab / variant-nowcast-model-dev-retro

Retrospective model dev for UMass models for variant-nowcast-hub
MIT License
1 stars 0 forks source link

add target data timeseries files #1

Open nickreich opened 1 week ago

nickreich commented 1 week ago

We want to use a recent "full open" metadata file from this page to create snapshots/versions of clade counts for each week, starting in late 2022 and running through a few months before the metadata file was downloaded (let's call this time period the "training phase").

A file should be created for each Monday in the "training phase". A file is created using the following approach:

  1. Filter the metadata file to include sequences per standard variant hub filterings (e.g. only Homo Sapiens, identifiable locations, ...). This could be done once before looping through each week.
  2. For a given as-of Monday, call it YYYY-MM-DD, filter the metadata file so that date_submitted < "YYYY-MM-DD".
  3. Use hub algorithm to determine which clades will be predicted on this date. We will call these the "modeled clades".
  4. Group sequence data by Nextstrain_clade, location, and date columns in the metadata file, and aggregate into clade-location-date counts with a column name of observation.
IsaacMacarthur commented 5 days ago

So I took a swing at this, and I ended up with two versions of a script to create these data files. The first version target_data_maker.R just uses write_parquet to write a dataset for each Monday between 2022-08-01 and 2024-08-05 that contains all the sequences that were submitted by that Monday. The second version target_data_maker_v2.R instead creates a partition of the dataset where each file contains all the sequences that were submitted between two of the target dates; i.e the 2022-08-08 file would contain all the sequences submitted between 2022-08-01 and 2022-08-08, with the first file containing all sequences submitted before 2022-08-01. The first approach stores more data and stores duplicates, but it makes it easy to load in the exact dataset we want for a given time point. The second approach doesn't store any duplicate data points, but it makes it harder to get the dataset for a given date, because you may need to sum many files to get the full dataset for that date. Is there an approach you prefer from these two methods? Let me know and I can create a PR with the one you prefer, or both if you want to see both.