materialsproject / MPContribs

Platform for materials scientists to contribute and disseminate their materials data through Materials Project
https://mpcontribs.org
MIT License
35 stars 20 forks source link

portal: S3 versioned downloads of projects/datasets #81

Open tschaume opened 4 years ago

tschaume commented 4 years ago

Identical to FigShare's ndownloader URLs, the MPContribsML deployment already supports downloading full matbench_* datasets through AWS S3 compressed as .json.gz files using a top-level URL, e.g. https://ml.materialsproject.cloud/matbench_expt_gap.json.gz. This functionality is being used in hackingmaterials/matminer#446 to load datasets for mining.

Powered by the /contributions/download/ endpoint in the MPContribs API, downloads through the according button on the MPContribs landing pages are currently being generated dynamically/server-side and respect the filters and sorting (query parameters) applied in the UI. Both csv and json formats are supported, each with or without full structure objects (full vs minimal). Right now, the only supported MIME type is gz but other types could be valuable to implement in the future (e.g. bzip, jpeg, png, vnd.plotly.v1+json, x-ipynb+json, ods, pdf, tar, zip, excel, xml).

To reduce duplicate and potentially heavy DB queries, the download endpoint could encode the query in the filename and save it to S3 upon first request or after update of the underlying data. This would implicitly maintain versioned snapshots of the datasets as API POST/PUT requests would add a timestamp to the old file and the next GET request would generate a new file. The S3 bucket storing the exported project data would have a sub-folder for each MPContribs deployment (Main, ML, LightSources, ...).

A progress bar is needed while the first export file for a project and query is generated on S3. It would use server-side events and a Redis cache as already implemented for the dynamic (re-)generation of Jupyter notebooks which power MPContribs Contribution Details Pages.

If a file for the specific project and query without a timestamp exists, ~the API's /contributions/download/ endpoint would simply return a 302 Redirect to S3 - thus relegating download traffic to S3. Alternatively,~ the API could use the boto3 client to retrieve the file from S3 and load it into memory, and then return it as a response to the request. ~However, this would cause unnecessary implementation, maintenance, and monitoring efforts as well as strain on the API Fargate tasks.~
EDIT 06/19/2020: I chose to always go through the API Fargate task and keep the S3 bucket private (next paragraph outdated)

~The consequence of a simple redirect is that authentication/authorization can be enforced on generating the file export (saving to S3) on the first request but not on subsequent download requests from the public S3 bucket. The MPContribs URLs for the portal and the API could technically still use authentication/authorization for retrieval of the data exports but the URL to the S3 object would need to be public anyways. S3 storage of export files would thus only be enabled for public projects which could be an additional inducement for contributors to make their data available to the public.~

Saving files from the API Fargate task to S3 does not incur extra data traffic or processing costs since the S3 Gateway Endpoint is free (as opposed to a NAT Gateway) and the S3 bucket is in the same AWS region. However, there will be costs related to traffic caused by downloads of the (compressed and predominantly small) S3 objects and its bare storage. The latter can be optimized by setting up lifecycle policies which automatically move objects into other storage levels depending on their monthly access frequency. For instance, old timestamped snapshots would likely move into cheaper Glacier storage since they'll only be needed/downloaded occasionally.

tschaume commented 4 years ago