This is a pretty major PR and most detail is in commit messages + comments in the code. A high level overview of the functionality introduced:
For our core & staging sources (s3://nextstrain-data and s3://nextstrain-staging, respectively) we create daily inventories on S3, which are stored in the private bucket s3://nextstrain-inventories. These are essentially free to generate (<$1/year/bucket).
Our server pulls these inventories down daily and maintains a set of datasets and files, including past versions of each where available.
A new API is introduced to access this data in a cross-source fashion, including filtering.
The /pathogens page shows the core datasets using the UI I recently prototyped. A new /pathogens/inputs page shows the core files. The /staging page (previously not working) also now uses this new UI.
As is the case with feature pushes of this scope there is no clear end point. There are a huge number of potential improvements we can make, so for this PR please indicate if you consider something blocking vs an improvement we can tackle in subsequent PRs 🙏
From my point of view, the following needs to be done before merge (in the same vein as how I approached the prototype, I'm trying to share work at an early stage):
Client-side error handling if the API calls 404 / time out / contain no data.
Amend the production IAM policy to have access to the new bucket (review apps use the testing one which I’ve already modified)
[Could be done after merge] Implement a S3 lifecycle to expire inventory objects after 30 days (?)
I'll make some notes via in-line comments about feature pushes I don't think are blocking here, but which come up as a result of this work, so that discussions can be threaded.
Testing
It should 🤞 all work via review apps. To test locally, you can avoid the S3 API calls by creating a (git-ignored) ./devData folder and adding a manifest+inventory for each bucket with the following filenames:
(You can pick any day's inventory, that's not so important for dev purposes.) Then run the server with a LOCAL_INVENTORY=true environment variable.
P.S. manifest JSON here refers to the S3 inventory manifest and is completely unrelated to our existing usage of manifest JSONs. However this work should eventually allow us to remove those manifests. So that's nice.
This is a pretty major PR and most detail is in commit messages + comments in the code. A high level overview of the functionality introduced:
s3://nextstrain-data
ands3://nextstrain-staging
, respectively) we create daily inventories on S3, which are stored in the private buckets3://nextstrain-inventories
. These are essentially free to generate (<$1/year/bucket)./pathogens
page shows the core datasets using the UI I recently prototyped. A new/pathogens/inputs
page shows the core files. The/staging
page (previously not working) also now uses this new UI.As is the case with feature pushes of this scope there is no clear end point. There are a huge number of potential improvements we can make, so for this PR please indicate if you consider something blocking vs an improvement we can tackle in subsequent PRs 🙏
From my point of view, the following needs to be done before merge (in the same vein as how I approached the prototype, I'm trying to share work at an early stage):
I'll make some notes via in-line comments about feature pushes I don't think are blocking here, but which come up as a result of this work, so that discussions can be threaded.
Testing
It should 🤞 all work via review apps. To test locally, you can avoid the S3 API calls by creating a (git-ignored)
./devData
folder and adding a manifest+inventory for each bucket with the following filenames:(You can pick any day's inventory, that's not so important for dev purposes.) Then run the server with a
LOCAL_INVENTORY=true
environment variable.P.S. manifest JSON here refers to the S3 inventory manifest and is completely unrelated to our existing usage of manifest JSONs. However this work should eventually allow us to remove those manifests. So that's nice.