Option to convert Arrow table to/from Parquet on disk.

uccross / skyhookdm-ceph-cls

Skyhook Data Management: Storage and management of tabular data in Ceph.

GNU Lesser General Public License v2.1

13 stars 9 forks source link

After transforming from Flatbuffer to Arrow data format, we should include the option to save the arrow table as Arrow or Parquet format. This should probably go here. Our tests have shown the conversion from Arrow to Parquet for LINEITEM table does save disk space due to compression.

Similarly when reading Arrow format from disk, we should check if the data is stored in Parquet format if so convert to Arrow before processing. Probably here after unpacking the blob from fbmeta, we should convert from Parquet blob to Arrow in mem before calling processArrow(). We do not need a processParquet() function at this time. Since blob data will already be in memory we should always convert and treat it as Arrow once it is read from disk to memory.

We have already included Parquet libs in our install_deps.sh alongside Arrow, so the required Parquet API (if needed) is already present and can be included similarly to Arrow as here.

There is already a placeholder for the parquet type here.

Then consider when reading off disk, check this value and then extract the parquet as arrow where needed before calling a processArrow method as noted below. However when returning data to client (over the wire) the data should remain as Arrow, so no changes are required after extracting the data from disk and passing to a processArrow method.

Anywhere there is a case/switch statement that uses arrrow format for processing steps such as , you can likely also add SFT_PARQUET type.
Such as here when extracting the meta info from the fbmeta wrapper.

Convert a data blob from parquet to arrow, here before passing to the processArrow method, you will need to take the meta.blob_data (char* data) and convert it first.

uccross / skyhookdm-ceph-cls

Option to convert Arrow table to/from Parquet on disk. #18