nextstrain / forecasts-ncov

SARS-CoV-2 variant growth rates and frequency forecasts
https://nextstrain.org/sars-cov-2/forecasts/
7 stars 2 forks source link

Forecase Automation: Provision counts files #1

Closed joverlee521 closed 2 years ago

joverlee521 commented 2 years ago

(Original issue copied over for public view)

Here's an outline of my plan:

A couple questions I have:

  1. Should these full count files continue to exclude rows with 0 cases/sequences? In past experience, it's better to be explicit about 0 counts to differentiate 0 vs NA.
  2. Do we want Slack notifications for the automated count updates? (I think yes since we don't have a good monitoring system set up yet.)
  3. If yes to Slack notifications, would #forecasting-automation be the appropriate channel for update messages?
joverlee521 commented 2 years ago

Comment from @trvrb:

Should these full count files continue to exclude rows with 0 cases/sequences? In past experience, it's better to be explicit about 0 counts to differentiate 0 vs NA.

I like dropping 0s in this case as otherwise the file sizes are much larger than they need to be. There are a bunch of short lived variants.

Do we want Slack notifications for the automated count updates? (I think yes since we don't have a good monitoring system set up yet.)

I'd make a new channel for this.

Note that I added an additional criteria in count provisioning: https://github.com/blab/rt-from-frequency-dynamics/tree/master/data/variants-us

I drop samples that have QC_overall_status listed as bad.


We should also be clear with data.nextstrain.org/files locations for open data and s3://nextstrain-ncov-private/ locations for GISAID data so that we have a hopefully stable pseudo-API (like we've been trying to do with files/).

We need a stable system for global targets that define location as country-level vs country targets that define location as division-level. Initial datasets would be:

joverlee521 commented 2 years ago

We should also be clear with data.nextstrain.org/files locations for open data and s3://nextstrain-ncov-private/ locations for GISAID data so that we have a hopefully stable pseudo-API (like we've been trying to do with files/).

Currently files in data.nextstrain.org/files/ncov/open/ match files in s3://nextstrain-ncov-private/. To keep this consistent, I propose the following:

# Public
data.nextstrain.org/files/ncov/open/counts/global/case-counts.tsv.gz
data.nextstrain.org/files/ncov/open/counts/global/clade-counts.tsv.gz
data.nextstrain.org/files/ncov/open/counts/usa/case-counts.tsv.gz
data.nextstrain.org/files/ncov/open/counts/usa/clade-counts.tsv.gz

# Private
s3://nextstrain-ncov-private/counts/global/clade-counts.tsv.gz
s3://nextstrain-ncov-private/counts/usa/clade-counts.tsv.gz
tsibley commented 2 years ago

I'd suggest not putting this data under https://data.nextstrain.org/files/… as the current usage and intent of that prefix is for pathogen-build related files that directly correspond to https://nextstrain.org/… paths. These counts are a separate thing, right? (IIUC, some of the counts are downstream of the same ncov data, but not part of the actual input/build?) I'm missing a lot of context here but what about https://data.nextstrain.org/counts/…?

curl 'https://data.cdc.gov/resource/9mfq-cb36.csv?…

I'd recognize a Socrata URL anywhere! If you haven't seen them yet, there are lots of dev/API docs at https://dev.socrata.com/. Socrata (looks like acquired now by "Tyler Technologies"??) for years lead big pushes for public orgs at all levels to use the Socrata data portal and run it at data.X domains.

tsibley commented 2 years ago

@joverlee521 and I chatted about this a bit in our 1:1 today, with the takeaway that this question relates back to the larger questions around structure/organization of data.nextstrain.org/files/… and how it relates (or doesn't) to nextstrain.org/… URLs which also intersects with larger questions of nextstrain remote download/upload behaviour. We'll fold discussion of this counts data specifically into a (pending) larger discussion of those issues, as they also arose recently with https://github.com/nextstrain/ncov/pull/910.

huddlej commented 2 years ago

@joverlee521 Can we close this given that #2 has been merged?

joverlee521 commented 2 years ago

Closed by #2