pangeo-data / pangeo

Pangeo website + discussion of general issues related to the project.
http://pangeo.io
698 stars 188 forks source link

HSDS: another way to access HDF5/NETCDF4 "files" from S3 #75

Closed rsignell-usgs closed 6 years ago

rsignell-usgs commented 6 years ago

I was able to successfully run a demonstration notebook accessing data from HSDS, which, like zarr, stores HDF5 or NETCDF4 datasets as chunks, with each chunk in an S3 object.

In the sample notebook here, I'm accessing data from an HSDS instance on XSEDE, yet the access times are comparable to running the same notebook on XSEDE. Google Cloud and XSEDE are connected via Internet 2, I assume.

2018-01-15_11-21-53 2018-01-15_11-21-19

To run this notebook on pangeo as I did, you would need to:

Here's the procedure I used for creating the h5pyd environment:

conda env create -f h5pyd_env.yml -y
source activate h5pyd
conda install xarray -y
conda remove h5netcdf
pip install --no-deps --upgrade git+https://github.com/ajelenak-thg/h5netcdf.git@h5pyd
conda install --no-deps xarray -y

For more info on HSDS, check out John Readey's Scipy 2017 talk on HSDS

mrocklin commented 6 years ago

I'm very glad to see this.

Some things that would be interesting to try if anyone has time:

  1. Try XArray + Dask locally on the HSDS data to verify that it can be accessed concurrently from multiple threads
  2. Try XArray + Dask.distributed locally on the HSDS data to verify that the h5pyd objects can survive being serialized
  3. Try everything on a distributed cluster using KubeCluster and then look at the performance of scalable computing
  4. Try this all again on a cluster on S3, where presumably we would expect 100-200MB/s network access from each node.
stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 6 years ago

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.