List datasets + files from core + staging sources

jameshadfield commented 12 months ago

This is a pretty major PR and most detail is in commit messages + comments in the code. A high level overview of the functionality introduced:

For our core & staging sources (s3://nextstrain-data and s3://nextstrain-staging, respectively) we create daily inventories on S3, which are stored in the private bucket s3://nextstrain-inventories. These are essentially free to generate (<$1/year/bucket).
Our server pulls these inventories down daily and maintains a set of datasets and files, including past versions of each where available.
A new API is introduced to access this data in a cross-source fashion, including filtering.
The /pathogens page shows the core datasets using the UI I recently prototyped. A new /pathogens/inputs page shows the core files. The /staging page (previously not working) also now uses this new UI.

As is the case with feature pushes of this scope there is no clear end point. There are a huge number of potential improvements we can make, so for this PR please indicate if you consider something blocking vs an improvement we can tackle in subsequent PRs 🙏

From my point of view, the following needs to be done before merge (in the same vein as how I approached the prototype, I'm trying to share work at an early stage):

Client-side error handling if the API calls 404 / time out / contain no data.
Amend the production IAM policy to have access to the new bucket (review apps use the testing one which I’ve already modified)
[Could be done after merge] Implement a S3 lifecycle to expire inventory objects after 30 days (?)

I'll make some notes via in-line comments about feature pushes I don't think are blocking here, but which come up as a result of this work, so that discussions can be threaded.

Testing

It should 🤞 all work via review apps. To test locally, you can avoid the S3 API calls by creating a (git-ignored) ./devData folder and adding a manifest+inventory for each bucket with the following filenames:

./devData/core.manifest.json          ./devData/core.inventory.csv.gz
./devData/staging.manifest.json       ./devData/staging.inventory.csv.gz

(You can pick any day's inventory, that's not so important for dev purposes.) Then run the server with a LOCAL_INVENTORY=true environment variable.

P.S. manifest JSON here refers to the S3 inventory manifest and is completely unrelated to our existing usage of manifest JSONs. However this work should eventually allow us to remove those manifests. So that's nice.

tsibley commented 12 months ago

Mostly a note to self. Things to make sure to review here based on the Blab Nextstrain meeting just now:

[ ] Lifecycle policies for s3://nextstrain-inventories
[ ] Terraform config for s3://nextstrain-inventories
[ ] Terraform config for IAM policies
[ ] How does/can this relate to resource-listing endpoints for use by RESTful API clients?
[ ] Lifecycle policies for other S3 buckets with trial/test/dev stuff… what are we missing?

jameshadfield commented 3 months ago

Replaced by #803 and #719

nextstrain / nextstrain.org

List datasets + files from core + staging sources #700

Testing