Open paddymul opened 11 months ago
Look at https://arrow.apache.org/docs/js/ https://github.com/vega/falcon https://github.com/pola-rs/nodejs-polars https://github.com/uwdata/arquero https://github.com/kylebarron/parquet-wasm
parquet-wasm looks best suited (per the author)
Look at this bit for ag-grid integration. https://www.ag-grid.com/react-data-grid/infinite-scrolling/
Currently serialization of dataframes is very slow on the python side in particular. Dataframes are serialized as a list of dicts, with odd slow handling of
NaN
's.by default Buckaroo downsamples so only 10k rows (unlimitted columns are serialized and sent to the frontend). For a dataframe that was 1.17 GB in memory as a baseline (5,000,000 rows total), computing stats on and displaying the first 10k rows (no downsampling, .1% of total df) took 460ms. computing stats on and displaying the first 500k rows took 891ms, the whole 5m rows took 5 seconds ... Note in all of these cases only 10k rows are serialized. From this we can tell that summary stats are generally fast, and serialization is a high constant factor.
for comparison
df[:10_000].to_numpy()
-> 4msdf.to_numpy()
-> 4 secondsdf[:10_000].to_csv()
-> 42msdf.to_csv()
-> lost patiencedf.to_parquet('foo.parq')
-> 1.6 SecondsOff the top of my head, at around 300k rows, JS sorting in ag-grid becomes slow (+1 second)
How to speed it up?
json.loads(df.to_json(
step. build the same dict object layout in memory and let ipywidgets convert that back into to json for comms with the frontend. This step avoids some type conversion errors. 1.5x improvement off the top of my head.JSON.parse
in the frontenddf.to_json
, off the top of my head this is 2-4x faster for the same serialization than pandas