Investigate performance of FixedSizedList/List types in Parquet files

sjperkins commented 1 year ago

We use FixedSizeListArrays and ListArrays to represent tensor and variably shaped data, respectively. In the Apache Arrow columnar format, these structures simply establish a view over a flat buffer of values, with additional offset arrays for each dimension in the ListArray case.

Arrow doesn't map 1-1 to Parquet and this means that reading (and writing?) these nested structures can be inefficient, compared to I/O on primitive types. Relevant issue and comments:

To be honest parquet's tag line could be "It's good enough". You can almost certainly do 2-3x better than parquet for any given workload, but you really need orders of magnitude improvements to overcome ecosystem inertia. I suspect most workloads will also mix in byte arrays and/or object storage or block compression, at which point those will easily be the tall pole in decode performance.

https://github.com/apache/arrow/issues/34510#issuecomment-1464463384

Arrow based fixed size lists of primitive values (eg. tensors) shouldn't be converted to nested parquet data, but instead they are better as BYTE_ARRAY in parquet (while I think it'd be important sadly there is no fixed size BYTE_ARRAY in the parquet spec so it'll be still slightly slower than possible). Also some fast paths for never null data - which was not marked as non-nullable when the data was saved - can be useful too, but that's all.

So if optimal performance was desired for performing parquet i/o for nested, tensor type data, it sounds as if casting between the List types and Fixed Size Binary types (pyarrow.binary/Fixed Size Primitives) might be an easy fix to solve this, if performance proves to be a problem.

sjperkins commented 1 year ago

/cc @bennahugo @orliac for informational purposes

bennahugo commented 1 year ago

I think in first order, from my standpoint, we should compare a parquet-backed dataset and queries to what can be achieved with casacore for a reasonably sizable simulated MK+ database - 8k channels, 4s dumprate with dask accumulation and filtering operators. My idea is that we start working on incorporating this into something like ratt-ru/shadems and a set of notebooks for plotting MK+ data sets so that we can more easily commission the telescope.

ratt-ru / arcae

Investigate performance of FixedSizedList/List types in Parquet files #40