pangeo-data / pangeo-datastore

Pangeo Cloud Datastore
https://catalog.pangeo.io
48 stars 16 forks source link

Rethinking automated checks for pull requests #102

Closed charlesbluca closed 3 years ago

charlesbluca commented 4 years ago

After putting together a workflow to replicate the checks that Travis runs on pull requests, it is clear that we won't be able to run any tests that call to_dask() on requesters pays data unless we use some form of service account authentication (hence the use of secrets.SA_EMAIL, PROJECT_ID, etc.).

Repository secrets are hidden from forked branches, so this means that the authentication required to open datasets cannot be acquired if a pull request is made from a fork of this repo. With that in mind, should we:

  1. Refactor our testing for pull requests to check the catalogs in a way that avoids to_dask (linting the catalogs, checking that each entry's urlpath exists, etc.)
  2. Create a low-authority service account with a publicly accessible service key that allows it to be used by forked repositories
  3. Find a way to direct any forked branches to the main repository so pull requests can be properly authenticated

Could we get a discussion going on this?

rabernat commented 4 years ago

Thanks for thinking this through Charles. I think you have explained the problem well.

Here's an alternative idea. We could create a very simple API service that runs inside google cloud with the ability to open datasets, check their size, get the xarray repr, etc.

For reference, here is such a service I wrote for the pangeo gallery: https://github.com/pangeo-gallery/pangeo-gallery-bot/blob/master/main.py

Then we could call this API from both our automated checks in the CI and from the gallery website, for rendering previews asynchronously.

rabernat commented 4 years ago

We could also use a cloud function for this.

charlesbluca commented 4 years ago

Thanks for sharing this reference! I'll look into FastAPI and post updates. As for the cloud function, is the idea that a pull request would trigger one to run the tests and report the results as a check?

rabernat commented 4 years ago

Either the API or cloud function would work basically the same. We would issue an HTTP GET request that looks something like

GET http://pangeo-data-bot.gcp.pangeo.io/dataset/zarr/info?url=gs://pangeo-data/path/to/zarr

And this would return a json object with metadata, e.g. size, possibly html repr, etc. Or it would fail if the data is unreadable.

A cloud function is probably the best way to go, since there is no state needed.

rabernat commented 4 years ago

I created an example cloud function. https://us-central1-pangeo-181919.cloudfunctions.net/get-zarr-info?url=gs://cmip6/AerChemMIP/AS-RCEC/TaiESM1/histSST/r1i1p1f1/AERmon/od550aer/gn/

This returns the zarr info for the group:

{"Arrays":"lat, lat_bnds, lon, lon_bnds, od550aer, time, time_bnds, wavelength","Chunk store type":"fsspec.mapping.FSMap","Name":"/","No. arrays":8,"No. groups":0,"No. members":8,"Read-only":"False","Store type":"zarr.storage.ConsolidatedMetadataStore","Type":"zarr.hierarchy.Group"}

You can see and edit it here: https://console.cloud.google.com/functions/details/us-central1/get-zarr-info?project=pangeo-181919

If we work this way, I would try to do as much as possible using zarr rather than xarray. Zarr is much more lightweight and uses less memory.

charlesbluca commented 4 years ago

This is great! Thanks for putting this together Ryan.

I'll tweak the cloud function to see if we can get disk usage and/or an HTML representation. I think it might be useful to keep a fast-loading metadata overview like this on a separate endpoint from more complex operations like checking disk usage or loading an xarray preview.

rabernat commented 4 years ago

I think the term for this sort of thing is “microservices”. It’s very appealing because of the simplicity. I coded that up right in the browser editor.

But as with all code, we should always find some way to test it if we are going to rely on it. This is a useful overview: https://cloud.google.com/functions/docs/testing/test-overview We would also eventually want to get our cloud functions hooked up through GitHub, as you did for the flask app.

If we can move everything backend to these microservices, we can move to a serverless frontend like vue.js

On Apr 23, 2020, at 6:49 PM, Charles Blackmon-Luca notifications@github.com wrote:

 This is great! Thanks for putting this together Ryan.

I'll tweak the cloud function to see if we can get disk usage and/or an HTML representation. I think it might be useful to keep a fast-loading metadata overview like this on a separate endpoint from more complex operations like checking disk usage or loading an xarray preview.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

charlesbluca commented 4 years ago

Looking into your cloud function with more depth, it looks like we still run into a similar issue - the service account which the function assumes the identity of does not have credentials to access data within requester pays buckets.

For now I'll focus on adding some functionality to this microservice (such as HTML representation of datasets, their respective usage info), but at the moment this API will probably be limited to the publicly available data in gs://cmip6/.

Ideally what I think a long term solution would look like is updating intake/intake-xarray to the point where we can pass a project ID into to_dask and have it recognized and used by the GCSFS backend when opening the requester pays datasets.

charlesbluca commented 4 years ago

Looks like my problem was a simple fix - just needed to pass requester_pays=True.

For now I'm going to expand the cloud function so that it is able to open any of the datasets in the catalog or send a 400 for failure -- that should be all we need to fix pull request checks, but I'm sure additional features would be a huge help for the website.

charlesbluca commented 4 years ago

Got blocked up trying to access rasterio-backed datasets -- rasterio's handling of GCP credentials seems to be limited to only JSON tokens, so having authorization handled internally doesn't seem to be an option.

@rabernat, are you able to access any of the HydroSHEDS/SoilGrids data using strictly rasterio?

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 3 years ago

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.