pangeo-data / pangeo

Pangeo website + discussion of general issues related to the project.
http://pangeo.io
699 stars 188 forks source link

tiledb #120

Closed rabernat closed 4 years ago

rabernat commented 6 years ago

As long as we are discussing cloud storage formats, it seems like we should be looking at TileDB: https://www.tiledb.io/ https://github.com/TileDB-Inc/TileDB

TileDB manages data that can be represented as dense or sparse arrays. It can support any number of dimensions and store in each array element any number of attributes of various data types.

If it lives up to the promised goals, TileDB could really be the solution to our storage problems.

cc @kaipak, @sirrice

jhamman commented 6 years ago

@rabernat - I've seen this before too and it does look quite promising. I get the impression from reading the documentation that the Python API is not quite ready to go yet. In theory though, this could be yet another backend for xarray.

stavrospapadopoulos commented 6 years ago

Hello folks. We just released a new version of TileDB, which has optimized support for AWS S3 and a Python API that uses NumPy (there is a lot of info and examples in our docs: https://docs.tiledb.io/). We would love to get some feedback from you.

cc @jakebolewski, @tdenniston

mrocklin commented 6 years ago

Hi @stavrospapadopoulos ! Thanks for chiming in.

Do you happen to have a demo TileDB deployment somewhere publicly available that we could hit to try things out?

rabernat commented 6 years ago

Thanks @stavrospapadopoulos for sharing this information! TileDB looks like a promising possible storage layer for xarray / pangeo.

I have read through the documentation and have a couple of questions that I hope you can clarify whenever you might have the time:

Technical Questions

  1. Does TileDB use a client / server model? This was my impression when I saw the word "DB", but, looking more closely at the documentation, it appears that it is more like a [virtual] file format + an API for accessing those [virtual] files (similar to HDF5 or zarr).
  2. What sort of indexing is supported on the array-like objects returned by tiledb (e.g. the dense_array object in this example)?
  3. Does TileDB support distributed reads / writes on the same array (either via a shared filesystem or an S3 bucket) from multiple processes on a distributed cluster? If locking is required, does the python API provide a way for the user to pass a shared lock object?
  4. Does the python API support asynchronous I/O?
  5. The docs contain lots of examples, but can you point me to the complete python API documentation?
  6. Any estimate on when the conda-forge package will be available? (This greatly lowers the bar for the python community to try something new.)

Social Questions

  1. Can you point us towards some projects that are currently using TileDB?
  2. How does TileDB compare to zarr, our current library of choice for chunked / compressed array storage?
  3. Is anyone from your team interested in collaborating more directly on testing TileDB in the context of dask / xarray?
rabernat commented 6 years ago

Oh and one more technical question

  1. Is there a way to store array metadata in TileDB (e.g. units or other arbitrary attributes), as in HDF5, netCDF, zarr, etc?
stavrospapadopoulos commented 6 years ago

@jakebolewski, @tdenniston will chime in (probably next week), but here is my take.

Responses to Technical Questions

1) TileDB is an embeddable library (written in C and C++, more similar to HDF5 than zarr) that comes with a Python wrapper. You should truly think of TileDB as an alternative to HDF5, with (among some other important things) the extra capability of storing sparse arrays in addition to dense and also supporting a wide variety of backends in addition to posix filesystems (currently HDFS and S3, more in the future).

- About “DB” in the name: when we started working on TileDB at MIT, we thought that we would be an alternative to SciDB and follow its client/server architecture. We eventually decided to make TileDB a lightweight embeddable library (although we are exploring implementing some client/server functionality along with a REST API). We kept the name because people already knew the software as TileDB and because, well, we liked it. :) 

2) The Python API currently supports only positional indexing. We assume though that you want this, so this is where we are going. Please stay tuned.

3) Absolutely, this is the big strength of TileDB.

4) Our C and C++ API do, but we haven’t wrapped this for Python yet. It is trivial to do so for a NULL callback, but we need more time to make it work for arbitrary Python functions (e.g., lambdas). This is certainly in our roadmap.

5) There is a menu item API Reference in the docs. Here are the Python API docs.

6) We have already started working on this, we will publish it very soon. For now, the easiest way to start playing is

docker pull tiledb/tiledb:1.2.0
docker run -it tiledb/tiledb:1.2.0
python3
import tiledb

@mrocklin we will put together a demo very soon as well.

7) Yes. Please take a look at our key-value store functionality and a Python example. You can attach a key-value store to an array (say, with URI /path/to/array/), by storing it in URI, say, /path/to/array/meta. The C/C++ implementation is very flexible (and much more powerful than HDF5's attributes), as it allows you to store any type of keys (not just strings) and any number of attributes of arbitrary types as values (not just strings), inheriting all the benefits from TileDB arrays. Currently the Python API supports only string keys/values, but we will extend it to support equivalent functionality to C/C++ very soon.

Responses to Social Questions

1) TileDB is used at the core of GenomicsDB, which is the product of a collaboration between Intel HLS and the Broad Institute (which we started when I was still at Intel). GenomicsDB is officially a part of GATK 4.0. We are currently working with the Oak Ridge National Lab on another genomics project. We built a LiDAR data adaptor for Naval Postgraduate School (which we will release pretty soon). We have also started a POC with the JGI on yet another genomics project. We would love to see what value TileDB can bring to a use case like pangeo :).

2) TileDB is very similar to zarr in that respect. Some of the key differences: (i) TileDB is built exclusively in C and C++, which allows us to bind it to other HL languages beyond Python as well (e.g., our Java bindings are coming up soon, we are starting working on R, which is very popular in genomics), (ii) TileDB natively supports sparse arrays with a powerful new format, and (iii) TileDB builds its own tight integration with the storage backends, rather than relying on generic libraries like s3fs, which allows us to do some nice low-level optimizations. We would be very interested to compare TileDB vs zarr on your workloads though.

3) Absolutely! We love what you guys do and we will be benefitted enormously by your feedback. Please let me know if you would like to start an email thread ({stavros,jake,tyler}@tiledb.io).

mrocklin commented 6 years ago

The Python API currently supports only positional indexing. We assume though that you want this, so this is where we are going. Please stay tuned.

I suspect that positional indexing may not be requied. XArray is accustomed to dealing with data stores that don't support this (like numpy itself). It has logic to deal with it. I suspect that the desired answer is instead "anything numpy can provide, including integers, slices (of all varieties), lists, or boolean numpy arrays" or some subset of that.

@mrocklin we will put together a demo very soon as well.

The ideal solution from our perspective would be some publicly-available data on a GCS bucket (we could probably front the cost for this) and an easy way to install TileDB, ideally with a conda package. We could also set up a smaller example to run from S3, but are unlikely to host.

Conda

I'm biased here (I am employed by Anaconda) but I strongly recommend creating a conda package if you're looking for adoption within the broader numeric Python community.

One way to do this is through conda-forge, which operates a build farm with linux/windows/mac support. This is a community group that is very friendly and supportive. As an example, here are some links for HDF5 and H5Py which I'm guessing is similar-to but more-complex than your situation:

stavrospapadopoulos commented 6 years ago

I suspect that positional indexing may not be requied. XArray is accustomed to dealing with data stores that don't support this (like numpy itself). It has logic to deal with it. I suspect that the desired answer is instead "anything numpy can provide, including integers, slices (of all varieties), lists, or boolean numpy arrays" or some subset of that.

Correct. Currently we support integers and slices, whereas we are working on lists and boolean numpy arrays.

The ideal solution from our perspective would be some publicly-available data on a GCS bucket (we could probably front the cost for this) and an easy way to install TileDB, ideally with a conda package. We could also set up a smaller example to run from S3, but are unlikely to host.

As I mentioned above, TileDB provides tight integration with the backends, which means that we implement our own low-level IO functionality using the backend's C or C++ SDK. The new release has AWS S3 support. GCS is in our roadmap, but it will require quite some work to tightly integrate with it using its SDK. Do you prefer GCS to S3? Please note that we have no bias - we will eventually support both.

As a first step, we can certainly provide you with access to an S3 bucket + a conda package.

I'm biased here (I am employed by Anaconda) but I strongly recommend creating a conda package if you're looking for adoption within the broader numeric Python community.

As mentioned above, we are almost there. :)

mrocklin commented 6 years ago

Do you prefer GCS to S3?

The software packages have no preference.

But the particular distributed deployment of http://pangeo.pydata.org/ runs on GCP, which would make it trivial for folks to try running scalability and performance tests without paying data transfer costs (though short term I don't think that these will amount to much).

On a personal/OSS note I'd like to push people to support more than the dominant cloud vendor.

mrocklin commented 6 years ago

Having data on some cloud accessible system would make it easy to repeat this experience: https://youtu.be/rSOJKbfNBNk

mrocklin commented 6 years ago

Also cc @llllllllll who has been interested in comparing array storage systems

stavrospapadopoulos commented 6 years ago

OK, all this sounds good. We will set something up on S3 so that we can get some feedback, and we can run detailed benchmarks once we have the GCS integration. Thanks for taking a look!

mrocklin commented 6 years ago

One can also run dask/xarray feasibility studies and benchmarks on a single machine. It's just a bit less compelling due to the wealth of options with a local disk

On Fri, Feb 23, 2018 at 6:38 PM, Stavros Papadopoulos < notifications@github.com> wrote:

OK, all this sounds good. We will set something up on S3 so that we can get some feedback, and we can run detailed benchmarks once we have the GCS integration. Thanks for taking a look!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/120#issuecomment-368170706, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszCvZz0QsASeRSdHzqyc9xU_ShFx0ks5tX0wPgaJpZM4SMi0Q .

stavrospapadopoulos commented 6 years ago

On a side note, this is our up-to-date repo, which we maintain and further develop at TileDB-Inc: https://github.com/TileDB-Inc/TileDB

Not to be confused with the one that I used to develop at Intel Labs: https://github.com/Intel-HLS/TileDB

mrocklin commented 6 years ago

Good to know. I suspect that this is due to https://github.com/Intel-HLS/TileDB/issues/72 . Want me to resubmit?

On Fri, Feb 23, 2018 at 6:41 PM, Stavros Papadopoulos < notifications@github.com> wrote:

On a side note, this is our up-to-date repo, which we maintain and further develop at TileDB-Inc: https://github.com/TileDB-Inc/TileDB

Not to be confused with the one that I used to develop at Intel Labs: https://github.com/Intel-HLS/TileDB

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/120#issuecomment-368171214, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszDPlKkqJmejTzwh70SXA2bx4OkcPks5tX0y1gaJpZM4SMi0Q .

stavrospapadopoulos commented 6 years ago

No worries, fixed (with thanks).

rabernat commented 6 years ago

Thanks so much for your quick and detailed replies to our questions. At this point, it is clear to me that we definitely need to evaluate TileDB as a potential backend for xarray.

I imagine the first step will be to evaluate TileDB with dask. Perhaps @mrocklin can suggest the best approach to that.

The main practical obstacle for testing with xarray will be the need for a new xarray backend. I see two main options:

  1. Duplicate / modify the zarr backend. This involves about 500 lines of code plus another 200 of tests. (Note that the backend code is due for a refactor, pydata/xarray#1087.)
  2. Create a tiledbnetcdf project, similar to h5netcdf. This would essentially wrap TileDB with the same API as the netcdf4-python library, allowing xarray to use it as a storage backend with less (but still some) new backend code.

I think option 2 is attractive for several reasons. Geosciences are potentially one of the biggest potential sources of TileDB-type data (e.g. NASA is putting > 300 PB of data into the cloud over the next 5 years). NetCDF is already a familiar interface for our community, so this might significantly lower the bar for the broader community.

The pangeo project is stretched pretty thin right now in terms of developer time. So the main challenge will be to find someone to work on this. Maybe @sirrice can help us get a Columbia CS student interested.

mrocklin commented 6 years ago

Testing locally with dask is probably pretty easy, assuming that TileDB exposes a Python object that supports slicing (which I think we've already established that it does).

Relevant dask docs here: http://dask.pydata.org/en/latest/array-creation.html#numpy-slicing

On Mon, Feb 26, 2018 at 12:07 PM, Ryan Abernathey notifications@github.com wrote:

Thanks so much for your quick and detailed replies to our questions. At this point, it is clear to me that we definitely need to evaluate TileDB as a potential backend for xarray.

I imagine the first step will be to evaluate TileDB with dask. Perhaps @mrocklin https://github.com/mrocklin can suggest the best approach to that.

The main practical obstacle for testing with xarray will be the need for a new xarray backend. I see two main options:

  1. Duplicate / modify the zarr backend. This involves about 500 lines of code https://github.com/pydata/xarray/blob/master/xarray/backends/zarr.py plus another 200 of tests https://github.com/pydata/xarray/blob/master/xarray/tests/test_backends.py#L1132-L1362. (Note that the backend code is due for a refactor, pydata/xarray#1087 https://github.com/pydata/xarray/pull/1087.)
  2. Create a tiledbnetcdf project, similar to h5netcdf https://github.com/shoyer/h5netcdf. This would essentially wrap TileDB with the same API as the netcdf4-python https://github.com/Unidata/netcdf4-python library, allowing xarray to use it as a storage backend with less (but still some) new backend code.

I think option 2 is attractive for several reasons. Geosciences are potentially one of the biggest potential sources of TileDB-type data (e.g. NASA is putting > 300 PB of data into the cloud over the next 5 years). NetCDF is already a familiar interface for our community, so this might significantly lower the bar for the broader community.

The pangeo project is stretched pretty thin right now in terms of developer time. So the main challenge will be to find someone to work on this. Maybe @sirrice https://github.com/sirrice can help us get a Columbia CS student interested.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/120#issuecomment-368574309, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszNF9CRyOPoau3Jd6mJJOSfpxMtOfks5tYuTjgaJpZM4SMi0Q .

rabernat commented 6 years ago

cc @kaipak, who may be interested in this discussion

kaipak commented 6 years ago

Yep, I've been tagging along for the discussion :).

stavrospapadopoulos commented 6 years ago

Before we consider integrating TileDB with xarray or building a netcdf wrapper for TileDB, we suggest we perform some simple experiments to evaluate TileDB's performance on parallel writes and reads on nd-arrays using dask.

We need to start with the following questions:

1) What are we comparing with (e.g., zarr, hdf5, netcdf)? 2) What is the format of the input data to be ingested to TileDB and the other approaches? One idea would be to generate some nd-array data and store them in some simple CSV or flat binary format. We are open to any suggestions here. 3) Where will the input and ingested data be stored? We suggest two experiments: (i) both locally, (ii) both on S3. In the future we can test with more backends.

mrocklin commented 6 years ago

What are we comparing with (e.g., zarr, hdf5, netcdf)?

On normal file systems NetCDF4 on HDF5 is probably the thing to beat, at least for the science domains that interest the community active on this issue tracker (which is sizable). On cloud-based systems there is no obvious contender. I wrote about cloud options and concerns here: http://matthewrocklin.com/blog/work/2018/02/06/hdf-in-the-cloud

What is the format of the input data to be ingested to TileDB and the other approaches? One idea would be to generate some nd-array data and store them in some simple CSV or flat binary format. We are open to any suggestions here.

I don't think that this matters much. This community's workloads are typically write-once read-many. Data tends to be delivered in the form of NetCDF4 files. I don't think that people care strongly about the cost-to-convert. I think that the broader point is that you have to convert, which is a negative for any format other than NetCDF.

Where will the input and ingested data be stored? We suggest two experiments: (i) both locally, (ii) both on S3. In the future we can test with more backends.

Those are both common. So too are large parallel POSIX file systems.

mrocklin commented 6 years ago

Also, to be clear, if your goal is to gain adoption then I suspect that demonstrating modest factor speed improvements is not sufficient. Disk IO isn't necessarily a pain point, and the inertia to existing data formats is very high. Instead I think you would need to go a bit more broadly and demonstrate that various workflows now become feasible where they were not feasible before. Parallel and random access into cloud-storage is one such workflow, but there are likely others.

rabernat commented 6 years ago

If you are looking for an appropriate test dataset, I would recommend something from NASA. For example: GHRSST Level 4 MUR Global Foundation Sea Surface Temperature Analysis (v4.1)

A single file (e.g. ftp://podaac-ftp.jpl.nasa.gov/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2018/057/20180226090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc) is 377 MB, and there is one every day for the past 16 years, ~2.2 TB total. Any reasonable subset of this will make a good test dataset.

Simple yet representative things our users might want to do are:

There are of course more complicated workflows, but these are representative of the I/O bound ones.

mrocklin commented 6 years ago

Dask.array can do all of these computations. The question would be how things feel when Dask accesses data from TileDB when doing these computations.

stavrospapadopoulos commented 6 years ago

@mrocklin we are on the same page about adoption and TileDB data access using Dask. @rabernat this is very helpful, we can start with that.

jreadey commented 6 years ago

@rabernat - I like the strategy you outlined. Would it make sense to create a repo with a set of performance benchmarks based on this data collection? Ideally it would enable various I/O backends (e.g. HDF5 files, Fuse to S3, zarr, hsds, etc.) to be pluggedin. This would seem a good framework to evaluate performance for these common use cases.

jhamman commented 6 years ago

@jreadey et al. I think a formal benchmarking repository would be a great contribution to the community (see also #45 and #5). This is something I'd be happy to see move forward and would eagerly participate in getting setup.

jreadey commented 6 years ago

@rabernat - I was planning on doing some benchmarking with hdf5lib vs s3vfd vs hsds. I can start with some of the simpler codes and then ask you for help when I get stuck!

For now I'm thinking to keep it to just a single client (i.e. without dask distributed).

Is the Sea Temperature data on S3? Alternatively, we could use the DCP-30 dataset: s3://nasanex/NEX-DCP30. Any issues with that?

rabernat commented 6 years ago

Would it make sense to create a repo with a set of performance benchmarks based on this data collection?

Yes, I really like this idea. A modular system would be ideal and would save us from writing a lot of boilerplate.

Ideally it would enable various I/O backends (e.g. HDF5 files, Fuse to S3, zarr, hsds, etc.) to be pluggedin.

This is kind of what xarray already does! 😏 Unfortunately, we do not have direct xarray for some of these storage layers (e.g. TileDB), so we can't just use xarray. We will end up reinventing some of xarray's backend logic, but I suppose that's acceptable.

Does it make sense to try to integrate airspeedvelocity or is that overkill?

@kaipak, a Pangeo intern here at Columbia, has some time to contribute to this benchmark project. @jreadey, it would be great if you two could work together.

rabernat commented 6 years ago

I created a repo for this purpose under the pangeo org https://github.com/pangeo-data/storage-benchmarks

jreadey commented 6 years ago

@rabernat - great. I created a couple of issues under that repo to get the ball rolling.

rabernat commented 6 years ago

@stavrospapadopoulos: this discussion has motivated us to get a little more organized about this benchmarking process. I imagine you and your team will want to work on things internally for a while. Whenever you are ready to sync up and integrate the TileDB tests into this benchmark suite, just stop over at https://github.com/pangeo-data/storage-benchmarks.

stavrospapadopoulos commented 6 years ago

@rabernat I am very happy for this. I am following closely the benchmarking discussions. We will certainly integrate our TileDB tests when we are ready. Thanks again!

dopplershift commented 6 years ago

@WardF @DennisHeimbigner The pangeo team are interested to engage with you guys regarding some of this storage benchmark work. You would probably also be interested in TileDB.

DennisHeimbigner commented 6 years ago

A question. Do you have use cases for your sparse arrays? We (netcdf) have use cases for ragged arrays, which are a form of sparse.

rabernat commented 6 years ago

None of the current xarray functionality makes use of sparse arrays. But there are definitely applications in development that would benefit from this. For example JiaweiZhuang/xESMF#11 is discussing how to store weight files for regridding, which are very large, very sparse arrays. So TileDB's sparse capabilities could be useful there.

@mrocklin also has a sparse array library (https://github.com/pydata/sparse/) which has various applications. Serializing sparse arrays well is probably of interest there.

mrocklin commented 6 years ago

https://github.com/pydata/sparse/issues/1

jakebolewski commented 6 years ago

To add to this, TileDB also supports "ragged array" values for every cell. You can define a variable number of values per cell if the array is dense or sparse. I haven't wrapped this functionality in the Python API yet.

Another planned extension is to densify regions of a sparse array automatically, after a tile/region has been filled in beyond some capacity it is densified. This helps with block sparse applications (I believe simulation data you posted has this structure (landmasses), at least 30% of the data was NA).

@mrocklin I just came across that today :) https://github.com/TileDB-Inc/TileDB-Py/issues/23 It doesn't quite fit because I don't want to duplicate the coordinates for every attribute, but I think maybe we can extend that library to support coordinates with more than one value.

mrocklin commented 6 years ago

I'm glad to hear it. There is a fair amount of activity around that package now. I encourage you to reach out when you have issues.

hameerabbasi commented 6 years ago

@jakebolewski You can use rec arrays or np.void with Pydata/sparse, which can be used to store multiple values. We are also considering adding block formats.

jakebolewski commented 6 years ago

I am a little wary of using record arrays because it would involve a lot of copying (TileDB is a columnar store, every attribute is stored/compressed separately), but it would be useful to support column -> record conversion on the fly for a read query (to avoid as many copies as possible).

hameerabbasi commented 6 years ago

If you're willing to flip your logic, row indexing can be made a lot faster than column indexing. You can just do X[0] to get a row.

jacobtomlinson commented 6 years ago

I've only just caught up on this thread. Really interesting and I'm looking forward to seeing results.

@mrocklin if you need to use an AWS based cluster to do some experimentation then feel free to use ours.

hameerabbasi commented 6 years ago

It doesn't quite fit because I don't want to duplicate the coordinates for every attribute, but I think maybe we can extend that library to support coordinates with more than one value.

You can pass along the same internal coords array to multiple constructors, it will point to the same object (provided some weak guarantees). If you want, I could build that functionality into a test in pydata/sparse. Duplication of coords won't be an issue then. We actually use this property internally for some things already.

Also, block storage support is upcoming in pydata/sparse, so (provided all your columns are the same dtype), you could use that as well.

rsignell-usgs commented 6 years ago

It appears that the conda package for TileDB is now available! https://anaconda.org/conda-forge/tiledb https://anaconda.org/conda-forge/tiledb-py

mrocklin commented 6 years ago

Woot

On Thu, Apr 5, 2018 at 3:22 PM, Rich Signell notifications@github.com wrote:

It appears that the conda package for TileDB is now available! https://anaconda.org/conda-forge/tiledb

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/120#issuecomment-379048583, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszAlqerQ_7VAuoqE56wl4A3bOzGbQks5tlm8OgaJpZM4SMi0Q .

jakebolewski commented 6 years ago

Yep conda forge is working well for us 👍.

The current issue with the tiledb conda package is that it does not build the S3 / HDFS backends by default.

I have a PR for packaging the aws-cpp-sdk which is required for the S3 backend, but I have had no luck getting it approved as a conda forge recipe.

rsignell-usgs commented 6 years ago

@jakebolewski https://github.com/conda-forge/staged-recipes/pull/5469 looks like you didn't ping anyone for a review? I asked @ocefpaf to take a look.

jakebolewski commented 6 years ago

@rsignell-usgs thanks for the bump.

Our team has been a bit tied up with R / Java interfaces and working on some bio (genomics) use-cases for clients. We hope to re-engage with the pangeo project once multi-threading support for TileDB is merged (it is actively being worked on now). We have experimented with adding a parallel VFS backend and we can get excellent throughput so when the time comes I think we will be able to post some very good benchmark numbers.