pangeo-data / pangeo

Pangeo website + discussion of general issues related to the project.
http://pangeo.io
698 stars 188 forks source link

S3-netcdf-python #203

Closed jhamman closed 6 years ago

jhamman commented 6 years ago

This issue is meant to introduce the S3-netcdf-python developers to pangeo and visa versa. I saw @nmassey001's talk at EGU and suggested that it would be good to connect here. It may even make sense to add S3-netcdf-python to our benchmark suite.

@nmassey001, if your comfortable with it, maybe you can share your slides with folks here?

dopplershift commented 6 years ago

cc @WardF @DennisHeimbigner

nmassey001 commented 6 years ago

I've committed my EGU presentation to the presentations folder in the repo. https://github.com/cedadev/S3-netcdf-python/blob/master/presentations/EGU_2018_nrmassey.pdf

I'd suggest reading the README.md file if you haven't already: https://github.com/cedadev/S3-netcdf-python/blob/master/README.md

@jhamman pangeo looks like a really good initiative, so I'd be happy to be a part of it.

rabernat commented 6 years ago

This looks fantastic. Similar to what we are doing currently with xarray + zarr.

Has anyone tried opening one of these stores with xarray? Presumably, if it uses the netcdf4-python api, then it should "just work". However, there could be some overlap in functionality once dask gets involved (threads, caching, etc.)

@nmassey001: with Pangeo, our goal is to provide a scalable, flexible computational platform that can interface directly with data stored in hpc and cloud. Seems like there could be a lot of synergy between our efforts. There has been lots of discussion of the storage format question within our group--this blog post by @mrocklin indicates that our teams have converged on basically the same viewpoint.

Very excited to see where this leads.

niallrobinson commented 6 years ago

in addition - are we going to test some kind of OpenDAP approach - I think this is what Copernicus is using.

nmassey001 commented 6 years ago

In theory you should be able to write and read the fragments from an OpenDAP server, but it is untested.

bnlawrence commented 6 years ago

Untested but on my mind to test as soon as feasible ...

jreadey commented 6 years ago

Hey @nmassey001 - the message passing paths are not quite right in the HSDS diagram. It's: client -> LoadBalancer -> Service Node -> Data Node -> S3. The Async and head node singletons aren't typically involved in client requests.

For S3-netcdf - are multi-writer scenarios supported? It would seem there'd be a danger of one client overwriting another client's update.

Finally, @kaipak and I are developing a benchmark suite to compare different storage frameworks. It would be great if you would like to include s3-netcdf to the mix.

nmassey001 commented 6 years ago

@jready Hi Jon, thanks for the clarification. I'll update the slides. I didn't have time to dwell on HSDS during my presentation anyway - we just wanted to show that we know that other methods exist and that ours is not better but it is different.

At the moment multi writer scenarios are not supported - one client could indeed overwrite another client's update. This is something we have to think about, but could support via something as simple as access control lists, and possibly changing / updating the ACL dynamically.

I'm happy to include s3-netcdf into any benchmarking or evaluation initiative as we're very interested (from a data centre point of view) in the strengths and weaknesses of the different approaches. We can see a scenario where we deploy two or more - e.g. HSDS for serving archive data and s3-netcdf for user files.

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 6 years ago

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.