Closed joverlee521 closed 2 years ago
Comment from @trvrb:
Should these full count files continue to exclude rows with 0 cases/sequences? In past experience, it's better to be explicit about 0 counts to differentiate 0 vs NA.
I like dropping 0s in this case as otherwise the file sizes are much larger than they need to be. There are a bunch of short lived variants.
Do we want Slack notifications for the automated count updates? (I think yes since we don't have a good monitoring system set up yet.)
I'd make a new channel for this.
Note that I added an additional criteria in count provisioning: https://github.com/blab/rt-from-frequency-dynamics/tree/master/data/variants-us
I drop samples that have QC_overall_status
listed as bad
.
We should also be clear with data.nextstrain.org/files locations for open data and s3://nextstrain-ncov-private/ locations for GISAID data so that we have a hopefully stable pseudo-API (like we've been trying to do with files/
).
We need a stable system for global targets that define location as country-level vs country targets that define location as division-level. Initial datasets would be:
We should also be clear with data.nextstrain.org/files locations for open data and s3://nextstrain-ncov-private/ locations for GISAID data so that we have a hopefully stable pseudo-API (like we've been trying to do with files/).
Currently files in data.nextstrain.org/files/ncov/open/
match files in s3://nextstrain-ncov-private/
.
To keep this consistent, I propose the following:
# Public
data.nextstrain.org/files/ncov/open/counts/global/case-counts.tsv.gz
data.nextstrain.org/files/ncov/open/counts/global/clade-counts.tsv.gz
data.nextstrain.org/files/ncov/open/counts/usa/case-counts.tsv.gz
data.nextstrain.org/files/ncov/open/counts/usa/clade-counts.tsv.gz
# Private
s3://nextstrain-ncov-private/counts/global/clade-counts.tsv.gz
s3://nextstrain-ncov-private/counts/usa/clade-counts.tsv.gz
I'd suggest not putting this data under https://data.nextstrain.org/files/…
as the current usage and intent of that prefix is for pathogen-build related files that directly correspond to https://nextstrain.org/…
paths. These counts are a separate thing, right? (IIUC, some of the counts are downstream of the same ncov data, but not part of the actual input/build?) I'm missing a lot of context here but what about https://data.nextstrain.org/counts/…
?
curl 'https://data.cdc.gov/resource/9mfq-cb36.csv?…
I'd recognize a Socrata URL anywhere! If you haven't seen them yet, there are lots of dev/API docs at https://dev.socrata.com/. Socrata (looks like acquired now by "Tyler Technologies"??) for years lead big pushes for public orgs at all levels to use the Socrata data portal and run it at data.X
domains.
@joverlee521 and I chatted about this a bit in our 1:1 today, with the takeaway that this question relates back to the larger questions around structure/organization of data.nextstrain.org/files/… and how it relates (or doesn't) to nextstrain.org/… URLs which also intersects with larger questions of nextstrain remote download/upload
behaviour. We'll fold discussion of this counts data specifically into a (pending) larger discussion of those issues, as they also arose recently with https://github.com/nextstrain/ncov/pull/910.
@joverlee521 Can we close this given that #2 has been merged?
Closed by #2
(Original issue copied over for public view)
Here's an outline of my plan:
nextstrain/counts
repo for this workncov-ingest
triggered GISAID clade countsncov-ingest
triggered Open clade countsncov-ingest
to trigger clade counts actions once updatedmetadata.tsv.gz
has been uploaded to S3A couple questions I have: