paddymul / buckaroo

Buckaroo - the data wrangling assistant for pandas. Quickly explore dataframes, and run pandas commands via a GUI. Works inside the jupyter notebook.
https://buckaroo-data.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
199 stars 8 forks source link

Higher performance dataframe serialization #76

Open paddymul opened 11 months ago

paddymul commented 11 months ago

Currently serialization of dataframes is very slow on the python side in particular. Dataframes are serialized as a list of dicts, with odd slow handling of NaN's.

by default Buckaroo downsamples so only 10k rows (unlimitted columns are serialized and sent to the frontend). For a dataframe that was 1.17 GB in memory as a baseline (5,000,000 rows total), computing stats on and displaying the first 10k rows (no downsampling, .1% of total df) took 460ms. computing stats on and displaying the first 500k rows took 891ms, the whole 5m rows took 5 seconds ... Note in all of these cases only 10k rows are serialized. From this we can tell that summary stats are generally fast, and serialization is a high constant factor.

for comparison

df[:10_000].to_numpy() -> 4ms df.to_numpy() -> 4 seconds df[:10_000].to_csv() -> 42ms df.to_csv() -> lost patience df.to_parquet('foo.parq') -> 1.6 Seconds

Off the top of my head, at around 300k rows, JS sorting in ag-grid becomes slow (+1 second)

How to speed it up?

  1. remove the json.loads(df.to_json( step. build the same dict object layout in memory and let ipywidgets convert that back into to json for comms with the frontend. This step avoids some type conversion errors. 1.5x improvement off the top of my head.
  2. make json a string property of the widget, call JSON.parse in the frontend
  3. move to polars for df.to_json, off the top of my head this is 2-4x faster for the same serialization than pandas
  4. figure out base64 serialization, based on ES6 typed arrays. Probably the fastest
  5. Investigate Arrow-js for binary serialization? downside is packaged sized
  6. polars-js? downside is packaged sized
paddymul commented 11 months ago

Look at https://arrow.apache.org/docs/js/ https://github.com/vega/falcon https://github.com/pola-rs/nodejs-polars https://github.com/uwdata/arquero https://github.com/kylebarron/parquet-wasm

parquet-wasm looks best suited (per the author)

https://github.com/kylebarron/parquet-wasm

paddymul commented 7 months ago

Look at this bit for ag-grid integration. https://www.ag-grid.com/react-data-grid/infinite-scrolling/