reichlab / variant-nowcast-hub

A repository to store COVID-19 variant nowcasts collected as a modeling hub.
MIT License
4 stars 0 forks source link

Add support to specify the start and end of a date range for sequence collection dates #8

Open elray1 opened 2 months ago

elray1 commented 2 months ago

We will typically just need to get clade assignments (and summarize to counts of clade assignments) for samples that were collected within a particular date range. We should be able to specify those dates as part of a call to assign_clades. This is related to discussion in #3 in that we'll need to be sure that when we pull the sequence data, we get everything that has a collection date within the specified range.

For more specificity, here's a suggestion that I'm not at all committed to: we could introduce command line arguments seq_start_date and seq_end_date and keep anything with seq_start_date <= collection_date <= seq_end_date.

bsweger commented 3 weeks ago

@elray1 quick clarification: how would these start and end dates interplay with the --released-after parameter we're already using when getting sequence data via the NCBI API?

Is the starting range of the sequence collection date the same date we'd use as --released-after on the API call, or a different parameter altogether?

elray1 commented 3 weeks ago

I believe it should at least be closely related (maybe released-after = seq_start_date - 1??). But I'm not sure. Do you know of a place where these dates are documented?

bsweger commented 3 weeks ago

I couldn't find anything definitive on the relationship between the API's released-after parameter and the collection-date in the metadata.

collection-date definition from the metadata schema:

The collection date for the sample from which the viral nucleotide sequence was derived

reference to "released after" property of NCBI's virus dataset downloads:

genomes released after

Let's chat about how to get a definitive answer. In the meantime, I'll use your released-after = seq_start_date - 1 to get started.