Higher performance dataframe serialization

paddymul commented 11 months ago

Currently serialization of dataframes is very slow on the python side in particular. Dataframes are serialized as a list of dicts, with odd slow handling of NaN's.

by default Buckaroo downsamples so only 10k rows (unlimitted columns are serialized and sent to the frontend). For a dataframe that was 1.17 GB in memory as a baseline (5,000,000 rows total), computing stats on and displaying the first 10k rows (no downsampling, .1% of total df) took 460ms. computing stats on and displaying the first 500k rows took 891ms, the whole 5m rows took 5 seconds ... Note in all of these cases only 10k rows are serialized. From this we can tell that summary stats are generally fast, and serialization is a high constant factor.

for comparison

df[:10_000].to_numpy() -> 4ms df.to_numpy() -> 4 seconds df[:10_000].to_csv() -> 42ms df.to_csv() -> lost patience df.to_parquet('foo.parq') -> 1.6 Seconds

Off the top of my head, at around 300k rows, JS sorting in ag-grid becomes slow (+1 second)

How to speed it up?

remove the json.loads(df.to_json( step. build the same dict object layout in memory and let ipywidgets convert that back into to json for comms with the frontend. This step avoids some type conversion errors. 1.5x improvement off the top of my head.
make json a string property of the widget, call JSON.parse in the frontend
move to polars for df.to_json, off the top of my head this is 2-4x faster for the same serialization than pandas
figure out base64 serialization, based on ES6 typed arrays. Probably the fastest
Investigate Arrow-js for binary serialization? downside is packaged sized
polars-js? downside is packaged sized

paddymul commented 11 months ago

Look at https://arrow.apache.org/docs/js/ https://github.com/vega/falcon https://github.com/pola-rs/nodejs-polars https://github.com/uwdata/arquero https://github.com/kylebarron/parquet-wasm

parquet-wasm looks best suited (per the author)

https://github.com/kylebarron/parquet-wasm

paddymul commented 7 months ago

Look at this bit for ag-grid integration. https://www.ag-grid.com/react-data-grid/infinite-scrolling/

paddymul / buckaroo

Higher performance dataframe serialization #76