Forecase Automation: Provision counts files

joverlee521 commented 2 years ago

(Original issue copied over for public view)

Here's an outline of my plan:

Create a new nextstrain/counts repo for this work
Port over the Python script from John's PR in blab/rt-from-frequencey-dynamics
- Remove date, country, clade cutoffs to get full counts (cutoffs would be added to the modeling scripts)
- Separate case counts and clade counts scripts. (Side note: there is an API for the CDC state case counts data that allows us to select and filter data before download, e.g. we can select specific columns and filter out 0 counts)
```
curl 'https://data.cdc.gov/resource/9mfq-cb36.csv?$select=submission_date,state,new_case&$where=new_case>0
```
Set up GitHub actions for
- scheduled daily case counts
- ncov-ingest triggered GISAID clade counts
- ncov-ingest triggered Open clade counts
Update ncov-ingest to trigger clade counts actions once updated metadata.tsv.gz has been uploaded to S3

A couple questions I have:

Should these full count files continue to exclude rows with 0 cases/sequences? In past experience, it's better to be explicit about 0 counts to differentiate 0 vs NA.
Do we want Slack notifications for the automated count updates? (I think yes since we don't have a good monitoring system set up yet.)
If yes to Slack notifications, would #forecasting-automation be the appropriate channel for update messages?

joverlee521 commented 2 years ago

Comment from @trvrb:

Should these full count files continue to exclude rows with 0 cases/sequences? In past experience, it's better to be explicit about 0 counts to differentiate 0 vs NA.

I like dropping 0s in this case as otherwise the file sizes are much larger than they need to be. There are a bunch of short lived variants.

Do we want Slack notifications for the automated count updates? (I think yes since we don't have a good monitoring system set up yet.)

I'd make a new channel for this.

Note that I added an additional criteria in count provisioning: https://github.com/blab/rt-from-frequency-dynamics/tree/master/data/variants-us

I drop samples that have QC_overall_status listed as bad.

We should also be clear with data.nextstrain.org/files locations for open data and s3://nextstrain-ncov-private/ locations for GISAID data so that we have a hopefully stable pseudo-API (like we've been trying to do with files/).

We need a stable system for global targets that define location as country-level vs country targets that define location as division-level. Initial datasets would be:

GISAID global
GISAID US
open global
open US

joverlee521 commented 2 years ago

We should also be clear with data.nextstrain.org/files locations for open data and s3://nextstrain-ncov-private/ locations for GISAID data so that we have a hopefully stable pseudo-API (like we've been trying to do with files/).

Currently files in data.nextstrain.org/files/ncov/open/ match files in s3://nextstrain-ncov-private/. To keep this consistent, I propose the following:

# Public
data.nextstrain.org/files/ncov/open/counts/global/case-counts.tsv.gz
data.nextstrain.org/files/ncov/open/counts/global/clade-counts.tsv.gz
data.nextstrain.org/files/ncov/open/counts/usa/case-counts.tsv.gz
data.nextstrain.org/files/ncov/open/counts/usa/clade-counts.tsv.gz

# Private
s3://nextstrain-ncov-private/counts/global/clade-counts.tsv.gz
s3://nextstrain-ncov-private/counts/usa/clade-counts.tsv.gz

tsibley commented 2 years ago

I'd suggest not putting this data under https://data.nextstrain.org/files/… as the current usage and intent of that prefix is for pathogen-build related files that directly correspond to https://nextstrain.org/… paths. These counts are a separate thing, right? (IIUC, some of the counts are downstream of the same ncov data, but not part of the actual input/build?) I'm missing a lot of context here but what about https://data.nextstrain.org/counts/…?

curl 'https://data.cdc.gov/resource/9mfq-cb36.csv?…

I'd recognize a Socrata URL anywhere! If you haven't seen them yet, there are lots of dev/API docs at https://dev.socrata.com/. Socrata (looks like acquired now by "Tyler Technologies"??) for years lead big pushes for public orgs at all levels to use the Socrata data portal and run it at data.X domains.

tsibley commented 2 years ago

@joverlee521 and I chatted about this a bit in our 1:1 today, with the takeaway that this question relates back to the larger questions around structure/organization of data.nextstrain.org/files/… and how it relates (or doesn't) to nextstrain.org/… URLs which also intersects with larger questions of nextstrain remote download/upload behaviour. We'll fold discussion of this counts data specifically into a (pending) larger discussion of those issues, as they also arose recently with https://github.com/nextstrain/ncov/pull/910.

huddlej commented 2 years ago

@joverlee521 Can we close this given that #2 has been merged?

joverlee521 commented 2 years ago

Closed by #2

nextstrain / forecasts-ncov

Forecase Automation: Provision counts files #1