Open nickreich opened 1 week ago
So I took a swing at this, and I ended up with two versions of a script to create these data files. The first version target_data_maker.R just uses write_parquet to write a dataset for each Monday between 2022-08-01 and 2024-08-05 that contains all the sequences that were submitted by that Monday. The second version target_data_maker_v2.R instead creates a partition of the dataset where each file contains all the sequences that were submitted between two of the target dates; i.e the 2022-08-08 file would contain all the sequences submitted between 2022-08-01 and 2022-08-08, with the first file containing all sequences submitted before 2022-08-01. The first approach stores more data and stores duplicates, but it makes it easy to load in the exact dataset we want for a given time point. The second approach doesn't store any duplicate data points, but it makes it harder to get the dataset for a given date, because you may need to sum many files to get the full dataset for that date. Is there an approach you prefer from these two methods? Let me know and I can create a PR with the one you prefer, or both if you want to see both.
We want to use a recent "full open" metadata file from this page to create snapshots/versions of clade counts for each week, starting in late 2022 and running through a few months before the metadata file was downloaded (let's call this time period the "training phase").
A file should be created for each Monday in the "training phase". A file is created using the following approach:
YYYY-MM-DD
, filter the metadata file so thatdate_submitted < "YYYY-MM-DD"
.Nextstrain_clade
,location
, anddate
columns in the metadata file, and aggregate into clade-location-date counts with a column name ofobservation
.