Improve data serialization

martinRenou commented 9 months ago

Make ipydatagrid more performant, achieving two things:

binary buffers usage in data serialization, this improves the communication between the back-end and the front-end quite a lot. (e.g. a million cells datagrid used to take 6 seconds to show up with the old approach with a local jupyter server on my laptop, it now takes half a second).
reducing memory footprint in the front-end by improving the data structure.

What's remaining to make the PR ready to review:

[x] update JS test code
[x] update Python test code
[x] cell editing seems broken, needs data serialization in the front-end and deserialization in the back-end
[x] filtering transform is not completely done yet
[x] Fix support for heterogeneous data types in columns (do not use binary buffers in that case)

In follow up PRs, the next items should be resolved:

use binary buffer for _visible_rows attribute
use binary buffer for schema and fields attribute?
improve the transforms/view logic to prevent making any copies of the original data. Views should be a way to "view" the original data, it shouldn't make any copy of the original one as much as it can.

ianthomas23 commented 8 months ago

I have tried this locally and I see the same dramatic speed improvements. It would be good to continue with this as it will be a good basis for experiments in filtering and sorting on the backend that I'd like to look at.

paddymul commented 8 months ago

I have been working just this week to better understand binary serialization from pandas through ipywidgets to js. I think I'm going to use arrow-js. I'm hoping to publish a very rough early repo later today.

I'm currently fleshing out a simple IPYWidget library that lets me prototype simple examples, and it will be easier to collaborate with other people since it's a simple library.

Trevor Manz and Kyle Barron have been doing work in this space too.

I'd love to collaborate with others on this.

paddymul commented 8 months ago

FWIW I just pushed the first commits to the serialization playground df_cereal https://github.com/paddymul/df_cereal

I have examples of arrow-js serialization working entirely in js. I currently can't get the python side to work to communicate bytes or base64 to JS

Benchmarks and more docs coming soon.

BTW I looked at what bqplot is doing. I suspect arrow based serialization will be much faster since it doesn't deal with json at all.

martinRenou commented 8 months ago

Thank you for reaching out @paddymul. This looks interesting!

will be much faster

I'm a tiny bit skeptical about this. The JSON message bqplot sends is minimal in the end.

I feel like we should go ahead with this PR once it's passing all tests. Then I'm 💯 to continue discussing on having a common place for having better binary serialization that we can use across widgets. I don't like depending on bqplot for this, but it was already a dependency for some reason (probably some legacy dependency due to removed code) so it's convenient to just use it for now.

jupyter-widgets / ipydatagrid

Improve data serialization #483