Consider changing sequence data filter for initial data pull from `--updated-after` to `--released-after`

reichlab / variant-nowcast-hub

A repository to store COVID-19 variant nowcasts collected as a modeling hub.

MIT License

16 stars 7 forks source link

Consider changing sequence data filter for initial data pull from `--updated-after` to `--released-after` #3

Closed elray1 closed 4 months ago

elray1 commented 6 months ago

Looking at this line: https://github.com/reichlab/variant-nowcast-hub/blob/57e41557dc41b92ca1b937be8e5cf6325bc819a2/data-pipeline/src/covid_variant_pipeline/assign_clades.py#L39

I am not sure exactly what the --updated-after filter does. It's probably worth double checking that it reliably gets us everything we need. It may be that using --released-after will get us to a safer/more inclusive data set that definitely includes all sequences with a collection date on or after the specified date.

bsweger commented 6 months ago

The --updated-after was just a way to ensure a small subset of the data for faster iteration when testing everything.

In a live version of this, will we need to use any dates at all? I.e., would it hurt to download the entire set of COVID-19 (homo sapien) genome data each time the pipeline runs? I think I saw in the dataset CLI docs that they maintain a cache of that, and I'd like to explore it.

elray1 commented 6 months ago

Exploring the cached data you mentioned makes sense! My current understanding is that most of the time we're running this, we will want to get counts based on sequences that were collected in roughly a ~2 month time span that's within the last ~6 months of the available data. I could imagine that using the cached data could be faster because it's cached, or pulling a subset could be faster because it's a substantially smaller amount of data than the full data set, or it might not matter because we could just set this thing up to run at 2am. In any case, the most important thing is for it to be correct :)

bsweger commented 4 months ago

This change was merged into the variant-data-pipeline as part of https://github.com/reichlab/variant-nowcast-hub/pull/15