Open sjperkins opened 1 year ago
/cc @bennahugo @orliac for informational purposes
I think in first order, from my standpoint, we should compare a parquet-backed dataset and queries to what can be achieved with casacore for a reasonably sizable simulated MK+ database - 8k channels, 4s dumprate with dask accumulation and filtering operators. My idea is that we start working on incorporating this into something like ratt-ru/shadems and a set of notebooks for plotting MK+ data sets so that we can more easily commission the telescope.
We use FixedSizeListArrays and ListArrays to represent tensor and variably shaped data, respectively. In the Apache Arrow columnar format, these structures simply establish a view over a flat buffer of values, with additional offset arrays for each dimension in the ListArray case.
Arrow doesn't map 1-1 to Parquet and this means that reading (and writing?) these nested structures can be inefficient, compared to I/O on primitive types. Relevant issue and comments:
So if optimal performance was desired for performing parquet i/o for nested, tensor type data, it sounds as if casting between the List types and Fixed Size Binary types (pyarrow.binary/Fixed Size Primitives) might be an easy fix to solve this, if performance proves to be a problem.