pangeo-data / pangeo

Pangeo website + discussion of general issues related to the project.
http://pangeo.io
693 stars 187 forks source link

why not Parquet? #285

Closed yamad closed 5 years ago

yamad commented 6 years ago

I'm looking for ways to make our HDF5 and some related file formats more Cloud capable, and found Pangeo while researching. I'm interested in working together and contributing where I can, especially experimenting with biomedical use-cases.

I had been exploring Parquet (and Arrow) as a underlying format that could work well, and was going to pull together an exploratory benchmark. But I don't want to reinvent the wheel, so I'm trying to understand better why Pangeo decided Parquet wasn't a good fit. I noticed @mrocklin's post http://matthewrocklin.com/blog/work/2018/02/06/hdf-in-the-cloud that mentions that Parquet's layout would be inefficient for chunked n-dimensional arrays.

But if, say, chunks were stored by column in a Parquet file, wouldn't that scheme be similar to the Zarr object store model where each chunk is stored as its own object? Perhaps then you derive little benefit from the Parquet data model, but you still have other benefits (e.g. existing Parquet and Arrow support).

I'm not arguing that Pangeo should move to another underlying format, I'm just hoping you can help me understand why Parquet doesn't work. Thanks!

shoyer commented 6 years ago

I think the short answer is that Parquet doesn't have any notion of n-dimensional arrays -- we would need to layer notions of chunks and shape. There are multiple ways we could imagine doing this, but in principle I think it could be pretty reasonable.

guillaumeeb commented 6 years ago

That really interests me. I am not really familliar with n dimensional arrays, but much more with parquet. I always assumed it was two separated worlds, and that parquet was limited to tabular data, but you seem to say that there may be a way of puting them together. If so we could indeed benefit from the support of parquet and its cloud ready underlying structure.

rabernat commented 6 years ago

If someone can figure out how to cram a netcdf file into parquet, that could be very useful. My impression from reading the spec is that @shoyer is correct: it was not designed with n-dimensional arrays in mind.

Sent from my iPhone

On Jun 4, 2018, at 4:25 PM, Guillaume EB notifications@github.com wrote:

That really interest me. I am not really familliar with n dimensional arrays, but much more with parquet. I always assumed it was two separated world, and parquet was limited to tabular data, but you seem to say that there may be a way of puting them together. If so we could indeed benefit from the support of parquet and its cloud ready underlying structure.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

mrocklin commented 6 years ago

To build details on @shoyer's comment we would need to layer notions of chunks and shape we can definitely place chunks of array data into parquet, much as we currently place them into GCS/S3 buckets. We can also add metadata to the parquet file to inform it about the shape and chunk of the data. I think that the challenge will come when we try to support chunk-wise random access into that data. We'll have to know exactly which row-group contains each particular chunk. We can probably put this into the per-chunk metadata distributed throughout the parquet dataset, and then ensure that each row group contains only a single chunk.

This would work fine, and would indeed get to leverage existing momentum behind the Parquet format. It would be some work though. If someone wanted to make progress on this issue I would encourage them to first try it out in a couple of situations, see what issues arise, iron those out, and eventually shop around a convention that could be layered on top of the Parquet format to see if you can build concensus from a few different groups. Technically speaking I don't immediately anticipate any particular challenge, but I do think that this process would be a moderate time investment to do well.

yamad commented 6 years ago

Yes, as @shoyer notes, we'd have to provide our own description of the N-dimensions held in any chunk. But this doesn't seem all that different to me than defining a new standard format like the Zarr spec which specifies chunks with metadata as a layer over objects in an object store. For Parquet, it would be a spec for how to store the metadata as a layer over the Parquet format.

I was more worried about Parquet being fundamentally poorly fit as an on-disk layout for n-dim data, but I don't see anything in the spec that looks totally unworkable from that angle. With chunks, we are jumping between bits of disk anyway. The trick will be finding a way to minimize it.


edit: just saw @mrocklin's comment after writing this.

The details in your first paragraph outline roughly the design I was thinking might work. I was under the impression that Parquet handled the column chunk locations for you, so with extra chunk/shape metadata it ought to be straightforward to find the right chunk. @guillaumeeb might have experience with that?

The implementation plan sounds good too. I agree it will take time. Hopefully I can find some. :-)

mrocklin commented 6 years ago

I was under the impression that Parquet handled the column chunk locations for you

I may be misreading this, but just to be sure I don't think that we would be using the columns intelligently here at all. I think that each chunk would be in a separate row. My guess is that for an n dimensional array we would have something like n + 2 columns. There would be n columns holding the chunk-index of each chunk, one column for the name of the dataset (assuming we want to support multiple arrays in a single Parquet dataset) and one column for the actual binary chunk data. We would keep parquet statistics on all of the n + 1 columns that don't hold the binary chunk data.

There are probably several ways to do this though. I suspect that by trying to faithfully roundtrip a non-trivial dataset from a NetCDF file that this would force a lot of learning.

shoyer commented 6 years ago

My guess is that for an n dimensional array we would have something like n + 2 columns.

This sounds like the COO format for sparse arrays, which would be inefficient for dense multi-dimensional arrays -- though I suppose parquet's compression would probably be extremely effective for indexes.

Instead, I would represent a dask array in a single column, with each column chunk corresponding to a flattened dask array chunk. Overall shape and chunks would be stored as column or file level metadata. Within each column, chunks would either (1) be laid out in some regular order (e.g., C-contiguous, which would facilitate memory-mapping without even reading in the full metadata) or (2) would have metadata identifying their chunk-indices in the overall array.

pbranson commented 6 years ago

I would like to add this (recent) blog post that some of you may be aware of as I think it bears some relevance to this discussion:

http://wesmckinney.com/blog/apache-arrow-pandas-internals/

I have frequently ran into memory issues processing data with XArray where the underlying python container is Pandas, I think due to many of the issues outlined in the above blog post. So whilst there are various options to resolve and optimise the n-dimensional data layout on disk for different cloud platforms, the Python container that the data is loaded into might be a significant bottleneck, depending on its in memory data layout.

Is that why the recent work on reading and wiring IRIS in XArray has been implemented in convert.py rather than as a backend? Does this circumvent using pandas by enforcing the use of Dask Arrays on top of Numpy n-dimensional arrays?

Instead, I would represent a dask array in a single column, with each

column chunk corresponding to a flattened dask array chunk. Overall shape and chunks would be stored as column or file level metadata. Within each column, chunks would either (1) be laid out in some regular order (e.g., C-contiguous, which would facilitate memory-mapping without even reading in the full metadata) or (2) would have metadata identifying their chunk-indices in the overall array

I agree with this approach and if I have interpreted what Wes McKinney outlines in his blog post this layout maps more naturally to the approach being applied in arrow.

On Tue, Jun 5, 2018 at 11:05 AM, Stephan Hoyer notifications@github.com wrote:

My guess is that for an n dimensional array we would have something like n

  • 2 columns.

This sounds like the COO format https://en.wikipedia.org/wiki/Sparse_matrix#Coordinate_list_(COO) for sparse arrays, which would be inefficient for dense multi-dimensional arrays -- though I suppose parquet's compression would probably be extremely effective for indexes.

Instead, I would represent a dask array in a single column, with each column chunk corresponding to a flattened dask array chunk. Overall shape and chunks would be stored as column or file level metadata. Within each column, chunks would either (1) be laid out in some regular order (e.g., C-contiguous, which would facilitate memory-mapping without even reading in the full metadata) or (2) would have metadata identifying their chunk-indices in the overall array.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/285#issuecomment-394566542, or mute the thread https://github.com/notifications/unsubscribe-auth/AM3bQO1OZ4GoXkjyvZhASmP93zwArc4jks5t5fV_gaJpZM4UZpBd .

mrocklin commented 6 years ago

This sounds like the COO format for sparse arrays

Each row contains a chunk of data, not a single element. Parquet can hold nested arrays.

shoyer commented 6 years ago

Each row contains a chunk of data, not a single element. Parquet can hold nested arrays.

Nevermind, I think I was just mixing up terminology. We essentially have the same proposal.

shoyer commented 6 years ago

Does this circumvent using pandas by enforcing the use of Dask Arrays on top of Numpy n-dimensional arrays?

I’m not quite sure I follow here. This is the general approach we currently use in xarray. Pandas objects are only preserved if you use them as indexes.

In the long term for xarray, I think we’ll want to switch this up into some mechanism that layers n-dimensional arrays (with an API like NumPy) over 1d pandas or arrow arrays (to handle dtypes not well handled by numpy), in a way that is independent of whether the array is used as an index or not. But this is not the current state of things.

pbranson commented 6 years ago

(with an API like NumPy) over 1d pandas

Is that sort of what dask is providing?

The idea that I am trying to grapple with is that data access patterns for many geo-science use cases is fundamentally different than tabular data analytics that have row based data-entities. Optimisation of on disk and in memory performance of columnar indexing and vectorised user defined function evaluation only translates to a few specific geoscience use cases where scans along 1 dimension (labelled with some categorical index) are important (i.e. creating climatologies).

A fundamental concept in geoscience is correlation in n-dimensional space due to discretisation of continuous functions onto an n-dimensional grid (vector quantities in three-dimensional space and time). Most analysis is then a taylor series type operation which results in a distance weighted kernel with data lookup that inherently crosses dimensions. If the data is modelled/observed on a dense grid, the dense array metadata provides memory mapped access that can considerably speed up analysis algorithms if the library is able to use the metadata to map directly to disk. If the data is modelled on a more complicated unstructured grid then some form of neighbour lookup is applied. I.e. the pangeo regridding use case and the performance difference between scipy.interpolate.RegularGridInterpolator and scipy.interpolate.LinearNDInterpolator with some added physical conservation constraints

Is this the crux of the problem that needs to be translated across to the distributed geoscience scenario and justifies converting from HDF files to Zarr if the goal is to perform machine learning/data analytics?

(I think I have been repeating things the pangeo participants all ready know!)

jacobtomlinson commented 6 years ago

The idea that I am trying to grapple with is that data access patterns for many geo-science use cases is fundamentally different than tabular data analytics that have row based data-entities.

This is one of the most interesting challenges we've been working on over the last few years at the Met Office. It looks like many people are starting to realize this problem and things are maturing in this area. Many projects like zarr and s3-netcdf are trying to address this.

vector quantities in three-dimensional space and time

Often but not always. For example we also work with multiple future realities (ensemble forecasting).

(I think I have been repeating things the pangeo participants all ready know!)

You're probably right but you have stated them in a very concise and understandable way!

pbranson commented 6 years ago

I’m not quite sure I follow here. This is the general approach we currently use in xarray. Pandas objects are only preserved if you use them as indexes.

After just watching @mrocklin great PyCon 2018 Pangeo presentation I realise where my mis-understanding was - I was conflating the arrays and objects used for the in-memory index with those used by the underlying data.

These are two different things, and Pandas is used to store the index, which is generated (at runtime?) based on the meta-data stored in the netCDF etc - and @shoyer's comment suggests these indexes can then persisted to memory/disk?

I have occasionally attempted to apply chunking when building netCDF and HDF files and I understood that this process was constructing a b-tree like datastructure that could also be persisted to speed up data retrieval of chunks from file. When the data is large and multi-dimensional the number of leaves on the tree can grow considerably and the filesize can increase considerably. When working directly on in-memory C-arrays these indexes are coded algorithmicly to make the analysis efficient without the index storage overhead (and the compiler does its best to optimize)

If a netCDF file has already been chunked does XArray directly load the index from file into a Pandas dataframe?

Does dask also help in the index construction or only in composing and applying the computational graph on the data, which in turn heavily relies on a performant index for mapping to disk?

alimanfoo commented 6 years ago

I don’t know if this is helpful, but FWIW different array storage systems use different mechanisms to organise and keep track of how chunks are stored on an underlying storage system like a file system or cloud object store. HDF5 happens to keep all chunks inside a single file, and I believe uses a b-tree to keep track of where chunks are within that file. Zarr has a number of different storage options, most of which do not need any additional data structures (like b-trees) to keep track of chunks. How these array storage systems are tuned and configured will affect performance, particularly if you’re doing any kind of distributed computing. But that is all orthogonal to the indexing that xarray is doing, which is about enabling users to query and make selections from the data [1]. And all dask is doing is composing and running task graphs over the data, so orthogonal to both array storage and xarray indexing. Hth.

[1] http://xarray.pydata.org/en/stable/indexing.html

On 9 June 2018 at 09:43, Paul Branson notifications@github.com wrote:

I’m not quite sure I follow here. This is the general approach we currently use in xarray. Pandas objects are only preserved if you use them as indexes.

After just watching @mrocklin https://github.com/mrocklin great PyCon 2018 presentation, I realise where my mis-understanding was - I was conflating the arrays and objects used for the in-memory index with those used by the underlying data.

These are two different things, and Pandas is used to store the index, which is generated (at runtime?) based on the meta-data stored in the netCDF etc - and @shoyer https://github.com/shoyer's comment suggests these indexes can then persisted to memory/disk?

I have occasionally attempted to apply chunking when building netCDF and HDF files and I understood that this process was constructing a b-tree like datastructure that could also be persisted to speed up data retrieval of chunks from file. When the data is large and multi-dimensional the number of leaves on the tree can grow considerably and the filesize can increase considerably. When working directly in C-arrays these indexes are coded algorithmicly to make the analysis efficient without the index storage overhead.

If a netCDF file has already been chunked does XArray directly load the index from file into a Pandas dataframe?

Does dask also help in the index construction or only in composing and applying the computational graph on the data, which in turn heavily relies on a performant index for mapping to disk?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/285#issuecomment-395951968, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QuljjwRh8gxeIM-mUAb_-z9JgQe5ks5t64rHgaJpZM4UZpBd .

-- If I do not respond to an email within a few days, please feel free to resend your email and/or contact me by other means.

Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health Big Data Institute Li Ka Shing Centre for Health Information and Discovery Old Road Campus Headington Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 or +44 (0)7866 541624 Email: alimanfoo@googlemail.com Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: @alimanfoo https://twitter.com/alimanfoo

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 5 years ago

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

elgalu commented 5 years ago

Would https://github.com/uber/petastorm help with this?

dazza-codes commented 5 years ago

Consider any N-dim array as a 2D columnar matrix with indexes for each of the N-dims, e.g.

dim1, dim2, dim3, ..., dimN, value

Arrow/Parquet should be able to optimize all facets of this with efficient compression of each of the dimX indexes and the value data (to the extend it is easily compressed), along with partitioned storage. (In this example, there is only one value, but it should be possible to have any number of values for vector or tensor data and additional columns for, say, GIS lat/lon values at every "grid" point.)

The addition of semantic metadata is not as clear. e.g. how to store GIS-CRS data that applies to all grid points.

See also:

jameshalgren commented 2 years ago

TL;DR: Where is this conversation now? Where can we help with better cloud optimization of geodata, especially forecast model output?

When I learned about Apache Arrow, my first question was the same as the OP -- why aren't we using this sort of intelligent queueing for earth datasets? I think the conversation has answer the question with respect specifically to Parquet/Arrow (IIUC multidimensional tables are not yet optimal).

We're facing this problem specifically with the output data from the national water model (the input is difficult as well). Universally, anyone using the data at all is forced to restructure the data in order to make it useable for querying and inspection. @jmccreight (and others) experimented with shredding the output into zarr format.

As we explore this, one of the challenges specifically with the NWM is the additional cardinality introduced by the forecast cycle -- the data need to not only preserve a time-stamp from the valid-time, but also some sense of forecast time or issue time.

The whole data community seems to be working on even the terminology for these questions: data cubes/lakes/meshes? (this is just one link from my modest attempt to survey the panoply) I've seen a teaser here regarding the work by @cgentemann. The open data cubes concept also seems to head in specifically this direction.

So... If anyone on the thread knows where this conversation on cloud-optimized earth datasets is focused, we would be happy to join and see what we can contribute.

Tagging a few folks interested in the question: @aaraney, @hellkite500, @ZacharyWills, @whitelightning450, @groutr, @karnesh

TomAugspurger commented 2 years ago

Just an FYI, the Pangeo discourse might be a better place for this conversation these days. A few quick notes though:

jameshalgren commented 2 years ago

Thank you @TomAugspurger. This discourse thread with a Recommendation for hosting cloud-optimized data seemed particularly useful.

In particular, from @rabernat in that thread:

I’m leading strongly toward the following recommendation: the primary data store on the cloud should be the original netCDF files + a kerchunk index. From this, it should be really easy to create copies / views / transformations of the data that are optimized for other use cases. For example, you could easily use rechunker to transform [a subset of] the data to support a different access pattern (e.g. timeseries analysis). Or you could put xpublish on top of the dataset to provide on-the-fly computation of derived quantities.

As I was digging around for useful nuggets, I found this help page on the ECMWF website talking about optimizing requests to make sure not to split tapes in a single request: https://confluence.ecmwf.int/display/WEBAPI/Retrieval+efficiency. What we're talking about here is a very similar, just no tapes (usually). Perhaps we head towards a joint application of intelligent request methods such as here together with wisely indexed archival methods like kerchunk.

rok commented 2 years ago

There is also https://issues.apache.org/jira/browse/ARROW-8714 for variable size tensors in case there is interest.