vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.23k stars 590 forks source link

Dataformat to work with extensible tabular datasets #2299

Open schwingkopf opened 1 year ago

schwingkopf commented 1 year ago

Hello,

I'm developing an application to log and live plot tabular data potentially larger than RAM on a local machine. I am looking for help/suggestions for choosing an appropriate data storage format.

I want to use vaex as my backend data-processing workhorse and would like to use a data format allowing fully parallel read-while-write capability: One process records the data and another (or multiple) process run vaex analysis in parallel. In general new data is always appended to existing data and once appended stays immutable. Main motivation for multiprocessing support is to have a stable logging bandwidth/performance in an isolated process that is independent from analysis/plotting tasks. I expect to log data rates as high as 1-10Mrows/s e.g. from USB connected sensors.

I considered the following data format/protocol options:

Numpy memory mapped arrays:

Does vaex fully support numpy memory mapped arrays? How does performance compare to HDF5?

Storage:

Analysis:

Advantage:

HDF5

I think there is not really an elegant way without using SWMR. Does vaex support SWMR?

In general how would I partially load the data from hdf5 (sliced to valid region) into vaex without race conditions? Is slicing on a "per file" basis even possible?

Operating w/o SWMR I could imagine writing data to file batches (similar to above protocol) and keeping a memory copy for those files that have not been completely written and work only on the memory data for those batches. I think that is doable but requires quite some book keeping.

Arrow

As far as I understand arrow files cannot be "appended" so data stream can only be flushed disk to when fully written. This prevents parallel read-while-write and also data cannot be "live persisted" to disk but only in discrete batches/files and is lost when program crashes underway.

Lightning Memory-Mapped Database (LMDB)

I guess there is no support in vaex?

Thank you in advance for any suggestions, hints or answers to my embedded questions (My current favourite is numpy memory mapped arrays, if they're supported)

maartenbreddels commented 1 year ago

Hi,

I think this usecase fits vaex well, except that we don't have builtin support for SWMR, but it shouldn't be too hard to build this on top of vaex.

Might be useful to read https://github.com/vaexio/vaex/issues/2078

Does vaex fully support numpy memory mapped arrays?

The hdf5 data is exposed as a numpy array. We can expose them as regular numpy arrays, but they are backed by a read-only or read/write buffer that is mmapped to the hdf5 file.

Does vaex support SWMR?

https://github.com/vaexio/vaex/issues/2078 gives an idea how to implement it. If you are ok with the readers polling, it should be simple to do.

Is slicing on a "per file" basis even possible?

https://github.com/vaexio/vaex/issues/2078 gives an example of this.

This indeed will be more challenging to do with Arrow. HDF5 contiguous arrays are just beautifully simple.

I'm pretty sure I did not answer all your questions, but after you digested #2078, feel free to ask more.

Regards,

Maarten