Closed rabernat closed 4 years ago
@rabernat - I've seen this before too and it does look quite promising. I get the impression from reading the documentation that the Python API is not quite ready to go yet. In theory though, this could be yet another backend for xarray.
Hello folks. We just released a new version of TileDB, which has optimized support for AWS S3 and a Python API that uses NumPy (there is a lot of info and examples in our docs: https://docs.tiledb.io/). We would love to get some feedback from you.
cc @jakebolewski, @tdenniston
Hi @stavrospapadopoulos ! Thanks for chiming in.
Do you happen to have a demo TileDB deployment somewhere publicly available that we could hit to try things out?
Thanks @stavrospapadopoulos for sharing this information! TileDB looks like a promising possible storage layer for xarray / pangeo.
I have read through the documentation and have a couple of questions that I hope you can clarify whenever you might have the time:
dense_array
object in this example)?Oh and one more technical question
@jakebolewski, @tdenniston will chime in (probably next week), but here is my take.
1) TileDB is an embeddable library (written in C and C++, more similar to HDF5 than zarr) that comes with a Python wrapper. You should truly think of TileDB as an alternative to HDF5, with (among some other important things) the extra capability of storing sparse arrays in addition to dense and also supporting a wide variety of backends in addition to posix filesystems (currently HDFS and S3, more in the future).
- About “DB” in the name: when we started working on TileDB at MIT, we thought that we would be an alternative to SciDB and follow its client/server architecture. We eventually decided to make TileDB a lightweight embeddable library (although we are exploring implementing some client/server functionality along with a REST API). We kept the name because people already knew the software as TileDB and because, well, we liked it. :)
2) The Python API currently supports only positional indexing. We assume though that you want this, so this is where we are going. Please stay tuned.
3) Absolutely, this is the big strength of TileDB.
First, TileDB supports both concurrent reads and writes (even interleaved) without any locking. Please check our concurrency and consistency model. For any operation where locking is required (e.g., creating an array), we implement our own locking (via mutexes, as well as file locks for the backends that support them) on the C++ side.
Second, note that TileDB implements its own tight integration with the various backends via C/C++ SDKs (e.g., libhdfs for HDFS and AWS C++ SDK for S3), and its own LRU cache. We do not rely on another package (e.g, s3fs like zarr). This enables us to do some pretty cool stuff (performance-wise) under the covers (e.g., we use our own async IO internally, we are currently implementing another parallel functionality, etc). You may also want to understand how our updates work, which is ideal for append-only backends like S3. So, TileDB is designed and continuously optimized for extreme parallelism for distributed backends, even beyond posix.
4) Our C and C++ API do, but we haven’t wrapped this for Python yet. It is trivial to do so for a NULL
callback, but we need more time to make it work for arbitrary Python functions (e.g., lambdas). This is certainly in our roadmap.
5) There is a menu item API Reference in the docs. Here are the Python API docs.
6) We have already started working on this, we will publish it very soon. For now, the easiest way to start playing is
docker pull tiledb/tiledb:1.2.0
docker run -it tiledb/tiledb:1.2.0
python3
import tiledb
@mrocklin we will put together a demo very soon as well.
7) Yes. Please take a look at our key-value store functionality and a Python example. You can attach a key-value store to an array (say, with URI /path/to/array/
), by storing it in URI, say, /path/to/array/meta
. The C/C++ implementation is very flexible (and much more powerful than HDF5's attributes), as it allows you to store any type of keys (not just strings) and any number of attributes of arbitrary types as values (not just strings), inheriting all the benefits from TileDB arrays. Currently the Python API supports only string keys/values, but we will extend it to support equivalent functionality to C/C++ very soon.
1) TileDB is used at the core of GenomicsDB, which is the product of a collaboration between Intel HLS and the Broad Institute (which we started when I was still at Intel). GenomicsDB is officially a part of GATK 4.0. We are currently working with the Oak Ridge National Lab on another genomics project. We built a LiDAR data adaptor for Naval Postgraduate School (which we will release pretty soon). We have also started a POC with the JGI on yet another genomics project. We would love to see what value TileDB can bring to a use case like pangeo :).
2) TileDB is very similar to zarr in that respect. Some of the key differences: (i) TileDB is built exclusively in C and C++, which allows us to bind it to other HL languages beyond Python as well (e.g., our Java bindings are coming up soon, we are starting working on R, which is very popular in genomics), (ii) TileDB natively supports sparse arrays with a powerful new format, and (iii) TileDB builds its own tight integration with the storage backends, rather than relying on generic libraries like s3fs
, which allows us to do some nice low-level optimizations. We would be very interested to compare TileDB vs zarr on your workloads though.
3) Absolutely! We love what you guys do and we will be benefitted enormously by your feedback. Please let me know if you would like to start an email thread ({stavros,jake,tyler}@tiledb.io).
The Python API currently supports only positional indexing. We assume though that you want this, so this is where we are going. Please stay tuned.
I suspect that positional indexing may not be requied. XArray is accustomed to dealing with data stores that don't support this (like numpy itself). It has logic to deal with it. I suspect that the desired answer is instead "anything numpy can provide, including integers, slices (of all varieties), lists, or boolean numpy arrays" or some subset of that.
@mrocklin we will put together a demo very soon as well.
The ideal solution from our perspective would be some publicly-available data on a GCS bucket (we could probably front the cost for this) and an easy way to install TileDB, ideally with a conda package. We could also set up a smaller example to run from S3, but are unlikely to host.
I'm biased here (I am employed by Anaconda) but I strongly recommend creating a conda package if you're looking for adoption within the broader numeric Python community.
One way to do this is through conda-forge, which operates a build farm with linux/windows/mac support. This is a community group that is very friendly and supportive. As an example, here are some links for HDF5 and H5Py which I'm guessing is similar-to but more-complex than your situation:
I suspect that positional indexing may not be requied. XArray is accustomed to dealing with data stores that don't support this (like numpy itself). It has logic to deal with it. I suspect that the desired answer is instead "anything numpy can provide, including integers, slices (of all varieties), lists, or boolean numpy arrays" or some subset of that.
Correct. Currently we support integers and slices, whereas we are working on lists and boolean numpy arrays.
The ideal solution from our perspective would be some publicly-available data on a GCS bucket (we could probably front the cost for this) and an easy way to install TileDB, ideally with a conda package. We could also set up a smaller example to run from S3, but are unlikely to host.
As I mentioned above, TileDB provides tight integration with the backends, which means that we implement our own low-level IO functionality using the backend's C or C++ SDK. The new release has AWS S3 support. GCS is in our roadmap, but it will require quite some work to tightly integrate with it using its SDK. Do you prefer GCS to S3? Please note that we have no bias - we will eventually support both.
As a first step, we can certainly provide you with access to an S3 bucket + a conda package.
I'm biased here (I am employed by Anaconda) but I strongly recommend creating a conda package if you're looking for adoption within the broader numeric Python community.
As mentioned above, we are almost there. :)
Do you prefer GCS to S3?
The software packages have no preference.
But the particular distributed deployment of http://pangeo.pydata.org/ runs on GCP, which would make it trivial for folks to try running scalability and performance tests without paying data transfer costs (though short term I don't think that these will amount to much).
On a personal/OSS note I'd like to push people to support more than the dominant cloud vendor.
Having data on some cloud accessible system would make it easy to repeat this experience: https://youtu.be/rSOJKbfNBNk
Also cc @llllllllll who has been interested in comparing array storage systems
OK, all this sounds good. We will set something up on S3 so that we can get some feedback, and we can run detailed benchmarks once we have the GCS integration. Thanks for taking a look!
One can also run dask/xarray feasibility studies and benchmarks on a single machine. It's just a bit less compelling due to the wealth of options with a local disk
On Fri, Feb 23, 2018 at 6:38 PM, Stavros Papadopoulos < notifications@github.com> wrote:
OK, all this sounds good. We will set something up on S3 so that we can get some feedback, and we can run detailed benchmarks once we have the GCS integration. Thanks for taking a look!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/120#issuecomment-368170706, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszCvZz0QsASeRSdHzqyc9xU_ShFx0ks5tX0wPgaJpZM4SMi0Q .
On a side note, this is our up-to-date repo, which we maintain and further develop at TileDB-Inc: https://github.com/TileDB-Inc/TileDB
Not to be confused with the one that I used to develop at Intel Labs: https://github.com/Intel-HLS/TileDB
Good to know. I suspect that this is due to https://github.com/Intel-HLS/TileDB/issues/72 . Want me to resubmit?
On Fri, Feb 23, 2018 at 6:41 PM, Stavros Papadopoulos < notifications@github.com> wrote:
On a side note, this is our up-to-date repo, which we maintain and further develop at TileDB-Inc: https://github.com/TileDB-Inc/TileDB
Not to be confused with the one that I used to develop at Intel Labs: https://github.com/Intel-HLS/TileDB
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/120#issuecomment-368171214, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszDPlKkqJmejTzwh70SXA2bx4OkcPks5tX0y1gaJpZM4SMi0Q .
No worries, fixed (with thanks).
Thanks so much for your quick and detailed replies to our questions. At this point, it is clear to me that we definitely need to evaluate TileDB as a potential backend for xarray.
I imagine the first step will be to evaluate TileDB with dask. Perhaps @mrocklin can suggest the best approach to that.
The main practical obstacle for testing with xarray will be the need for a new xarray backend. I see two main options:
tiledbnetcdf
project, similar to h5netcdf. This would essentially wrap TileDB with the same API as the netcdf4-python library, allowing xarray to use it as a storage backend with less (but still some) new backend code.I think option 2 is attractive for several reasons. Geosciences are potentially one of the biggest potential sources of TileDB-type data (e.g. NASA is putting > 300 PB of data into the cloud over the next 5 years). NetCDF is already a familiar interface for our community, so this might significantly lower the bar for the broader community.
The pangeo project is stretched pretty thin right now in terms of developer time. So the main challenge will be to find someone to work on this. Maybe @sirrice can help us get a Columbia CS student interested.
Testing locally with dask is probably pretty easy, assuming that TileDB exposes a Python object that supports slicing (which I think we've already established that it does).
Relevant dask docs here: http://dask.pydata.org/en/latest/array-creation.html#numpy-slicing
On Mon, Feb 26, 2018 at 12:07 PM, Ryan Abernathey notifications@github.com wrote:
Thanks so much for your quick and detailed replies to our questions. At this point, it is clear to me that we definitely need to evaluate TileDB as a potential backend for xarray.
I imagine the first step will be to evaluate TileDB with dask. Perhaps @mrocklin https://github.com/mrocklin can suggest the best approach to that.
The main practical obstacle for testing with xarray will be the need for a new xarray backend. I see two main options:
- Duplicate / modify the zarr backend. This involves about 500 lines of code https://github.com/pydata/xarray/blob/master/xarray/backends/zarr.py plus another 200 of tests https://github.com/pydata/xarray/blob/master/xarray/tests/test_backends.py#L1132-L1362. (Note that the backend code is due for a refactor, pydata/xarray#1087 https://github.com/pydata/xarray/pull/1087.)
- Create a tiledbnetcdf project, similar to h5netcdf https://github.com/shoyer/h5netcdf. This would essentially wrap TileDB with the same API as the netcdf4-python https://github.com/Unidata/netcdf4-python library, allowing xarray to use it as a storage backend with less (but still some) new backend code.
I think option 2 is attractive for several reasons. Geosciences are potentially one of the biggest potential sources of TileDB-type data (e.g. NASA is putting > 300 PB of data into the cloud over the next 5 years). NetCDF is already a familiar interface for our community, so this might significantly lower the bar for the broader community.
The pangeo project is stretched pretty thin right now in terms of developer time. So the main challenge will be to find someone to work on this. Maybe @sirrice https://github.com/sirrice can help us get a Columbia CS student interested.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/120#issuecomment-368574309, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszNF9CRyOPoau3Jd6mJJOSfpxMtOfks5tYuTjgaJpZM4SMi0Q .
cc @kaipak, who may be interested in this discussion
Yep, I've been tagging along for the discussion :).
Before we consider integrating TileDB with xarray or building a netcdf wrapper for TileDB, we suggest we perform some simple experiments to evaluate TileDB's performance on parallel writes and reads on nd-arrays using dask.
We need to start with the following questions:
1) What are we comparing with (e.g., zarr, hdf5, netcdf)? 2) What is the format of the input data to be ingested to TileDB and the other approaches? One idea would be to generate some nd-array data and store them in some simple CSV or flat binary format. We are open to any suggestions here. 3) Where will the input and ingested data be stored? We suggest two experiments: (i) both locally, (ii) both on S3. In the future we can test with more backends.
What are we comparing with (e.g., zarr, hdf5, netcdf)?
On normal file systems NetCDF4 on HDF5 is probably the thing to beat, at least for the science domains that interest the community active on this issue tracker (which is sizable). On cloud-based systems there is no obvious contender. I wrote about cloud options and concerns here: http://matthewrocklin.com/blog/work/2018/02/06/hdf-in-the-cloud
What is the format of the input data to be ingested to TileDB and the other approaches? One idea would be to generate some nd-array data and store them in some simple CSV or flat binary format. We are open to any suggestions here.
I don't think that this matters much. This community's workloads are typically write-once read-many. Data tends to be delivered in the form of NetCDF4 files. I don't think that people care strongly about the cost-to-convert. I think that the broader point is that you have to convert, which is a negative for any format other than NetCDF.
Where will the input and ingested data be stored? We suggest two experiments: (i) both locally, (ii) both on S3. In the future we can test with more backends.
Those are both common. So too are large parallel POSIX file systems.
Also, to be clear, if your goal is to gain adoption then I suspect that demonstrating modest factor speed improvements is not sufficient. Disk IO isn't necessarily a pain point, and the inertia to existing data formats is very high. Instead I think you would need to go a bit more broadly and demonstrate that various workflows now become feasible where they were not feasible before. Parallel and random access into cloud-storage is one such workflow, but there are likely others.
If you are looking for an appropriate test dataset, I would recommend something from NASA. For example: GHRSST Level 4 MUR Global Foundation Sea Surface Temperature Analysis (v4.1)
A single file (e.g. ftp://podaac-ftp.jpl.nasa.gov/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2018/057/20180226090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc) is 377 MB, and there is one every day for the past 16 years, ~2.2 TB total. Any reasonable subset of this will make a good test dataset.
Simple yet representative things our users might want to do are:
There are of course more complicated workflows, but these are representative of the I/O bound ones.
Dask.array can do all of these computations. The question would be how things feel when Dask accesses data from TileDB when doing these computations.
@mrocklin we are on the same page about adoption and TileDB data access using Dask. @rabernat this is very helpful, we can start with that.
@rabernat - I like the strategy you outlined. Would it make sense to create a repo with a set of performance benchmarks based on this data collection? Ideally it would enable various I/O backends (e.g. HDF5 files, Fuse to S3, zarr, hsds, etc.) to be pluggedin. This would seem a good framework to evaluate performance for these common use cases.
@jreadey et al. I think a formal benchmarking repository would be a great contribution to the community (see also #45 and #5). This is something I'd be happy to see move forward and would eagerly participate in getting setup.
@rabernat - I was planning on doing some benchmarking with hdf5lib vs s3vfd vs hsds. I can start with some of the simpler codes and then ask you for help when I get stuck!
For now I'm thinking to keep it to just a single client (i.e. without dask distributed).
Is the Sea Temperature data on S3? Alternatively, we could use the DCP-30 dataset: s3://nasanex/NEX-DCP30. Any issues with that?
Would it make sense to create a repo with a set of performance benchmarks based on this data collection?
Yes, I really like this idea. A modular system would be ideal and would save us from writing a lot of boilerplate.
Ideally it would enable various I/O backends (e.g. HDF5 files, Fuse to S3, zarr, hsds, etc.) to be pluggedin.
This is kind of what xarray already does! 😏 Unfortunately, we do not have direct xarray for some of these storage layers (e.g. TileDB), so we can't just use xarray. We will end up reinventing some of xarray's backend logic, but I suppose that's acceptable.
Does it make sense to try to integrate airspeedvelocity or is that overkill?
@kaipak, a Pangeo intern here at Columbia, has some time to contribute to this benchmark project. @jreadey, it would be great if you two could work together.
I created a repo for this purpose under the pangeo org https://github.com/pangeo-data/storage-benchmarks
@rabernat - great. I created a couple of issues under that repo to get the ball rolling.
@stavrospapadopoulos: this discussion has motivated us to get a little more organized about this benchmarking process. I imagine you and your team will want to work on things internally for a while. Whenever you are ready to sync up and integrate the TileDB tests into this benchmark suite, just stop over at https://github.com/pangeo-data/storage-benchmarks.
@rabernat I am very happy for this. I am following closely the benchmarking discussions. We will certainly integrate our TileDB tests when we are ready. Thanks again!
@WardF @DennisHeimbigner The pangeo team are interested to engage with you guys regarding some of this storage benchmark work. You would probably also be interested in TileDB.
A question. Do you have use cases for your sparse arrays? We (netcdf) have use cases for ragged arrays, which are a form of sparse.
None of the current xarray functionality makes use of sparse arrays. But there are definitely applications in development that would benefit from this. For example JiaweiZhuang/xESMF#11 is discussing how to store weight files for regridding, which are very large, very sparse arrays. So TileDB's sparse capabilities could be useful there.
@mrocklin also has a sparse array library (https://github.com/pydata/sparse/) which has various applications. Serializing sparse arrays well is probably of interest there.
To add to this, TileDB also supports "ragged array" values for every cell. You can define a variable number of values per cell if the array is dense or sparse. I haven't wrapped this functionality in the Python API yet.
Another planned extension is to densify regions of a sparse array automatically, after a tile/region has been filled in beyond some capacity it is densified. This helps with block sparse applications (I believe simulation data you posted has this structure (landmasses), at least 30% of the data was NA).
@mrocklin I just came across that today :) https://github.com/TileDB-Inc/TileDB-Py/issues/23 It doesn't quite fit because I don't want to duplicate the coordinates for every attribute, but I think maybe we can extend that library to support coordinates with more than one value.
I'm glad to hear it. There is a fair amount of activity around that package now. I encourage you to reach out when you have issues.
@jakebolewski You can use rec arrays or np.void
with Pydata/sparse, which can be used to store multiple values. We are also considering adding block formats.
I am a little wary of using record arrays because it would involve a lot of copying (TileDB is a columnar store, every attribute is stored/compressed separately), but it would be useful to support column -> record conversion on the fly for a read query (to avoid as many copies as possible).
If you're willing to flip your logic, row indexing can be made a lot faster than column indexing. You can just do X[0]
to get a row.
I've only just caught up on this thread. Really interesting and I'm looking forward to seeing results.
@mrocklin if you need to use an AWS based cluster to do some experimentation then feel free to use ours.
It doesn't quite fit because I don't want to duplicate the coordinates for every attribute, but I think maybe we can extend that library to support coordinates with more than one value.
You can pass along the same internal coords
array to multiple constructors, it will point to the same object (provided some weak guarantees). If you want, I could build that functionality into a test in pydata/sparse. Duplication of coords won't be an issue then. We actually use this property internally for some things already.
Also, block storage support is upcoming in pydata/sparse, so (provided all your columns are the same dtype
), you could use that as well.
It appears that the conda package for TileDB is now available! https://anaconda.org/conda-forge/tiledb https://anaconda.org/conda-forge/tiledb-py
Woot
On Thu, Apr 5, 2018 at 3:22 PM, Rich Signell notifications@github.com wrote:
It appears that the conda package for TileDB is now available! https://anaconda.org/conda-forge/tiledb
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/120#issuecomment-379048583, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszAlqerQ_7VAuoqE56wl4A3bOzGbQks5tlm8OgaJpZM4SMi0Q .
Yep conda forge is working well for us 👍.
The current issue with the tiledb
conda package is that it does not build the S3 / HDFS backends by default.
I have a PR for packaging the aws-cpp-sdk which is required for the S3 backend, but I have had no luck getting it approved as a conda forge recipe.
@jakebolewski https://github.com/conda-forge/staged-recipes/pull/5469 looks like you didn't ping anyone for a review? I asked @ocefpaf to take a look.
@rsignell-usgs thanks for the bump.
Our team has been a bit tied up with R / Java interfaces and working on some bio (genomics) use-cases for clients. We hope to re-engage with the pangeo project once multi-threading support for TileDB is merged (it is actively being worked on now). We have experimented with adding a parallel VFS backend and we can get excellent throughput so when the time comes I think we will be able to post some very good benchmark numbers.
As long as we are discussing cloud storage formats, it seems like we should be looking at TileDB: https://www.tiledb.io/ https://github.com/TileDB-Inc/TileDB
If it lives up to the promised goals, TileDB could really be the solution to our storage problems.
cc @kaipak, @sirrice