Open elray1 opened 4 months ago
@elray1 quick clarification: how would these start and end dates interplay with the --released-after
parameter we're already using when getting sequence data via the NCBI API?
Is the starting range of the sequence collection date the same date we'd use as --released-after
on the API call, or a different parameter altogether?
I believe it should at least be closely related (maybe released-after = seq_start_date - 1
??). But I'm not sure. Do you know of a place where these dates are documented?
I couldn't find anything definitive on the relationship between the API's released-after
parameter and the collection-date
in the metadata.
collection-date definition from the metadata schema:
The collection date for the sample from which the viral nucleotide sequence was derived
reference to "released after" property of NCBI's virus dataset downloads:
genomes released after
Let's chat about how to get a definitive answer. In the meantime, I'll use your released-after = seq_start_date - 1
to get started.
We will typically just need to get clade assignments (and summarize to counts of clade assignments) for samples that were collected within a particular date range. We should be able to specify those dates as part of a call to
assign_clades
. This is related to discussion in reichlab/variant-nowcast-hub#3 in that we'll need to be sure that when we pull the sequence data, we get everything that has a collection date within the specified range.For more specificity, here's a suggestion that I'm not at all committed to: we could introduce command line arguments
seq_start_date
andseq_end_date
and keep anything withseq_start_date <= collection_date <= seq_end_date
.