Dataformat to work with extensible tabular datasets

Hello,

I'm developing an application to log and live plot tabular data potentially larger than RAM on a local machine. I am looking for help/suggestions for choosing an appropriate data storage format.

I want to use vaex as my backend data-processing workhorse and would like to use a data format allowing fully parallel read-while-write capability: One process records the data and another (or multiple) process run vaex analysis in parallel. In general new data is always appended to existing data and once appended stays immutable. Main motivation for multiprocessing support is to have a stable logging bandwidth/performance in an isolated process that is independent from analysis/plotting tasks. I expect to log data rates as high as 1-10Mrows/s e.g. from USB connected sensors.

I considered the following data format/protocol options:

Numpy memory mapped arrays:

Does vaex fully support numpy memory mapped arrays? How does performance compare to HDF5?

Storage:

structured dtype for table mapping
allocate one file per table of fixed size (e.g. 5Mrows) and write data until its full
allocate more files as more data arrives
keep track of latest written index per file to know where valid/immutable data is located

Analysis:

query list of files and valid range of indexes to perform analysis on
slice arrays to valid data and concatenate to data frame
analyze
when done, update list of files and current index information and extend dataframe accordingly

Advantage:

out of the box race-free parallel read-while-write

HDF5

I think there is not really an elegant way without using SWMR. Does vaex support SWMR?

HDF5 SWMR resizable arrays (easy to implement):
- Data can be stored to a single file per table, that increases in size
- Layout not contiguous (is that even supported by vaex?)
HDF5 SWMR with fixed size contiguous layout
- protocol similar to description using numpy memory mapped arrays

In general how would I partially load the data from hdf5 (sliced to valid region) into vaex without race conditions? Is slicing on a "per file" basis even possible?

Operating w/o SWMR I could imagine writing data to file batches (similar to above protocol) and keeping a memory copy for those files that have not been completely written and work only on the memory data for those batches. I think that is doable but requires quite some book keeping.

Arrow

As far as I understand arrow files cannot be "appended" so data stream can only be flushed disk to when fully written. This prevents parallel read-while-write and also data cannot be "live persisted" to disk but only in discrete batches/files and is lost when program crashes underway.

Lightning Memory-Mapped Database (LMDB)

I guess there is no support in vaex?

Thank you in advance for any suggestions, hints or answers to my embedded questions (My current favourite is numpy memory mapped arrays, if they're supported)

Hi,

I think this usecase fits vaex well, except that we don't have builtin support for SWMR, but it shouldn't be too hard to build this on top of vaex.

Might be useful to read https://github.com/vaexio/vaex/issues/2078

Does vaex fully support numpy memory mapped arrays?

The hdf5 data is exposed as a numpy array. We can expose them as regular numpy arrays, but they are backed by a read-only or read/write buffer that is mmapped to the hdf5 file.

Does vaex support SWMR?

https://github.com/vaexio/vaex/issues/2078 gives an idea how to implement it. If you are ok with the readers polling, it should be simple to do.

Is slicing on a "per file" basis even possible?

https://github.com/vaexio/vaex/issues/2078 gives an example of this.

This indeed will be more challenging to do with Arrow. HDF5 contiguous arrays are just beautifully simple.

I'm pretty sure I did not answer all your questions, but after you digested #2078, feel free to ask more.

Regards,

Maarten

vaexio / vaex