Xarray backends or storage api's

jhamman commented 6 years ago

I think we should discuss whether or not to use xarray as the common interface for all of the benchmarks we evaluate as part of this project. There are pros and cons to using/not-using xarray. I bring this up because I noticed the direct use of h5py in #4.

Pros include:

Xarray provides a common interface wherein we can build real world science problems without writing custom interfaces to each storage api (thats what xarray does)
Within pangeo, we are promoting the use of high-level data-structures (typically xarray but Iris as well)

Cons include:

There are some known performance problems with xarray backends, some of which are not particularly storage format specific and can potentially be side-stepped by using the lower level storage api.
Using xarray assumes that we have implemented each api in a fair / equivalent way. We may introduce bias into one backend because of an incomplete/ill-performing implementation.

My vote would be to use xarray until we see it necessary to have more fine-grained tests. I think this will make implementation of real-world workflows easier and will be useful to us xarray developers in understanding chokepoints in the backends that we currently support.

mrocklin commented 6 years ago

My view is that XArray will be important to get information on real-world use cases, but that numpy or dask.array will be important as a first step. For example I would not expect TileDB folks to jump all the way to XArray without first stopping off at Numpy and then Dask.array.

My personal recommendation would be to avoid policies and restrictions until people actually start doing actual benchmarks here.

jreadey commented 6 years ago

I was thinking of something lower level than XArray/Dask (at least for a first order benchmark) so as to be a bit closer to the actual I/O operations.

An h5py-like API seems to be the common interface to the various backends (HSDS, zarr, HDF5Lib).

One level up using h5netcdf would let us use the same test driver and test HSDS and HDF5Lib. We have a PR to switch between h5py and h5pyd based on the filepath (https://github.com/shoyer/h5netcdf/pull/41). I imagine it wouldn't be too hard to add something like that for zarr.

rabernat commented 6 years ago

Joe...I hear your point loud and clear. It feels painful and wasteful to duplicate things I know already exist in xarray.

The problem is that TileDB does not yet have an xarray backend. It was a ton of work to implement the zarr backend. 95% of this work was about encoding and required a pretty deep knowledge of xarray internals. The desire to compare TileDB to existing backends is really what motivated us to start this benchmarking in the first place.

Here we are interested primarily in the raw throughput between the array storage library and dask. It is remarkably easy to create a dask array out of any array-like object. So my thinking was that we could bypass xarray, and all the encoding / attribute complications that it introduces, and focus on the lower layers of the stack.

However, maybe we can use xarray for all of the existing backends, and for the new ones, just do something like

array_store = tiledb.dataset('some_file')
dask.array = da.from_array(array_store, chunks=array_store._chunks)
ds = xr.DataArray(dask.array)

rabernat commented 6 years ago

It would actually be very interesting to quantify how much overhead xarray introduces, so I think maybe the answer is just "all of the above"?

rabernat commented 6 years ago

My personal recommendation would be to avoid policies and restrictions until people actually start doing actual benchmarks here.

I'm afraid I'm guilty of doing an "actual benchmark," perhaps prematurely: https://github.com/pangeo-data/storage-benchmarks/blob/master/benchmarks/hdf5.py

jreadey commented 6 years ago

@rabernat - premature or not, your benchmark brings up some interesting points that I wouldn't have occurred to me otherwise. One is given that it's using tempfile and shutil, there's an assumption that these are files on a posix filesystem. E.g. this code wouldn't work with h5pyd instead of h5py.

If we want to support content in cloud storage and non-posix filesystem, we might want create a simple cross-platform package for things like removing "files" (which may not really be files) and listing directory contents.

The other point is that the setup initializes a file and teardown removes it. This is convenient as there is no file state that has to be managed outside the context of the test run, but wouldn't be feasible if we are running against really large data collections such as LOCA. Also, since the data has just been written it is likely to be "hot" in the sense that much of the content is likely be on a disk cache or such. E.g. with HSDS it's probable that the entire dataset would still be in RAM on the server and and the read would not even touch S3.

Can we base these tests on an assumption that there is an existing data collection that has already been loaded? We could have a separate install/uninstall scripts to do the initial load, but not have to run these with each test run.

rabernat commented 6 years ago

@jreadey - the point of the example that I wrote was more about showing how to use ASV and how one might structure the benchmark suite. All the things you say are absolutely correct. For more "realistic" benchmarks involving cloud datasets, the setup and teardown functions would involve setting up cloud credentials and connecting to the appropriate resources. The simple example I made will probably have to be refactored considerably in order to accommodate a more general approach. It's up to you, @kaipak, and whoever else wants to participate to sort out these questions. I just wanted to put something up there to get the ball rolling.

FWIW, I think the benchmark suite should include on-disk performance tests (in addition to cloud datasets). As long as we are going to go to the trouble of doing this, let's make it comprehensive of all use cases.

jreadey commented 6 years ago

@rabernat - yes this is very helpful.

Agree we want to measure on disk performance. For HPC data centers I imagine this would mean having the data on a shared filesystem. For the cloud we could either copy files to a attached drive (would be a bit of hassle to manage) or use AWS EFS (sort of a managed NFS). Does GC have something equivalent?

In addition to the performance, we should document the cost model. E.g. extra storage for zarr data, containers to run HSDS, EFS costs, etc.

jhamman commented 6 years ago

It would actually be very interesting to quantify how much overhead xarray introduces, so I think maybe the answer is just "all of the above"?

So it sounds like we need a tiered approach:

Numpy computations backed by storage API (netcdf4-python, h5netcdf, hsds, zarr, etc)
Dask computations backed by storage APIs.
Xarray/numpy computations
Xarray/dask computations

I'm happy to see us start with with (1) and move forward only once we know more about how this is working. I suspect we have lots of other decisions to make regarding file systems, etc. that should take precedence in the near term.

@jreadey / @kaipak - let me know if you need any help getting started with the LOCA dataset, xarray, or zarr.

rabernat commented 6 years ago

@jhamman I like the approach you defined here.

It would be nice to design the benchmark suite using a composable set of classes which make it easy to plug the different elements together.

jakebolewski commented 6 years ago

After looking at the XRay backend interface I would have some concerns (at the moment) in the level of effort it would take (on the TileDB side):

The api seems to be in flux (there is an unmerged PR for changing the api)
It is really difficult to discern what functionality is actually required of a backend (what types does the backend have to store, what is the assumed structure of the netcdf / xray metadata, what indexing is required, what information needs to be serialized / deserialized etc.)
- Maybe I haven't looked at the tests thoroughly enough, but what would be helpful would be explicit backend tests that start simple and then incrementally build on the implemented simple functionality towards higher level functionality (so the process is not, implement the backend and try to track down individually why every failing test is not working).

I think documentation would be the first issue, but making it easier to develop a backend would have large payoffs for this project.

rabernat commented 6 years ago

@jakebolewski - thanks for this valuable feedback! We are all in agreement that the xarray backend API and test suite could use some improvement. It is basically accessible only to seasoned xarray devs. I don't think that anyone expects TileDB devs to take this on at this point.

(FYI, the name "XRay" was abandoned quite a while ago; the project is now called "xarray".)

cc @shoyer

shoyer commented 6 years ago

I'm not even sure xarray's backend API is accessible to seasoned xarray devs :).

The API has grown organically as our needs have gotten more complex, and we haven't taken the necessary time to clean it up.

jhamman commented 6 years ago

@shoyer - do we have an issue in the xarray repository that should be used to discuss this? I can probably take on a backend API refactor later this year as part of the pangeo project but this is not the right issue to hash out the details.

shoyer commented 6 years ago

@jhamman I don't think we have an open issue, just my stalled PR (https://github.com/pydata/xarray/pull/1087). Please open one!

kaipak commented 6 years ago

So it sounds like we need a tiered approach:

Numpy computations backed by storage API (netcdf4-python, h5netcdf, hsds, zarr, etc) Dask computations backed by storage APIs. Xarray/numpy computations Xarray/dask computations

Sounds like a good plan. I just started working on this and diving into all the pieces that need to fit together for the benchmarks.

@jreadey / @kaipak - let me know if you need any help getting started with the LOCA datase, xarray, or zarr.

As the total newbie here, I would gladly accept any assistance on getting up to speed on all this. I've been playing around a bit with Xarray, Dask, and Zarr so I shouldn't be completely useless at this juncture ;).

pangeo-data / storage-benchmarks

Xarray backends or storage api's #6