Closed elray1 closed 4 months ago
The --updated-after
was just a way to ensure a small subset of the data for faster iteration when testing everything.
In a live version of this, will we need to use any dates at all? I.e., would it hurt to download the entire set of COVID-19 (homo sapien) genome data each time the pipeline runs? I think I saw in the dataset
CLI docs that they maintain a cache of that, and I'd like to explore it.
Exploring the cached data you mentioned makes sense! My current understanding is that most of the time we're running this, we will want to get counts based on sequences that were collected in roughly a ~2 month time span that's within the last ~6 months of the available data. I could imagine that using the cached data could be faster because it's cached, or pulling a subset could be faster because it's a substantially smaller amount of data than the full data set, or it might not matter because we could just set this thing up to run at 2am. In any case, the most important thing is for it to be correct :)
This change was merged into the variant-data-pipeline
as part of https://github.com/reichlab/variant-nowcast-hub/pull/15
Looking at this line: https://github.com/reichlab/variant-nowcast-hub/blob/57e41557dc41b92ca1b937be8e5cf6325bc819a2/data-pipeline/src/covid_variant_pipeline/assign_clades.py#L39
I am not sure exactly what the
--updated-after
filter does. It's probably worth double checking that it reliably gets us everything we need. It may be that using--released-after
will get us to a safer/more inclusive data set that definitely includes all sequences with a collection date on or after the specified date.