larray-project / larray

N-dimensional labelled arrays in Python
https://larray.readthedocs.io/
GNU General Public License v3.0
8 stars 6 forks source link

implement Feather-based format for arrays #1016

Open gdementen opened 1 year ago

gdementen commented 1 year ago

Old HDF format

>>> %timeit Session('demo.h5')
2.09 s

Faster/current HDF format

>>> %timeit Session('demo_fast.h5')
1.25 s

Pure Pandas

This gives an approximate lower bound of what we could achieve via #724 -- maybe Pandas does a bit too much but I doubt we would get below 500ms

>>> import pandas as pd
>>> sto = pd.HDFStore('demo_fast.h5')
>>> %timeit {k: sto[k] for k in sto.keys()}
781 ms

My working proof of concept for a format based on Feather files & PyArrow

This is 8x as fast as the current best format and at least 3x as fast as what I think we could achieve using raw PyTables (as of now (*)).

>>> %timeit Session('demo4.laf')
152 ms
>>> Session('demo4.laf').equals(Session('demo_fast.h5'))
True

(*) There is some in-progress projet to use a new HDF mechanism in PyTables to provide (much) faster I/O but this still a WIP and there is no guarantee it will be completed & integrated "soon" (the project is supposed to end by the end of the year).

https://groups.google.com/g/pytables-dev/c/8Y95Us1bJNo