Open themrmax opened 7 years ago
I can't reproduce. Please provide a link to an example notebook.
@gnestor did you try my example code? maybe you just have a faster computer than me?
Yes, I did. Initially, it was throttled by the default data rate limit (which will ship with 5.0):
After overriding it, I was able to fetch the data and it was slow to respond to user events initially (because it's a lot of data and a lot of DOM nodes), but then I was able to change focus with no issue (I tried the j
and k
keyboard shortcuts too):
There isn't much we can do to resolve the notebook being slow when a ton of data is rendered, except for implementing some form of infinite scrolling that can paginate through data and ideally fetch new data from the kernel vs. keeping it in memory in the browser.
@gnestor yes this makes sense, although it might be a lot of work I think this is definitely a feature we should have, I've renamed the issue accordingly.
@Carreau @takluyver Has there been any discussion around infinite scrolling for output areas? I'm not sure how much (if any) performance improvements we would get without kernel interaction (e.g. fetching new pages of data from the kernel).
@themrmax If kernel interaction is required, then this feature would not be implemented in notebook but rather ipywidgets or another "infinite scrolling output area" extension that can interact with ipython and ideally other kernels. This could be a good experiment for a kernel-connected mimerender extension.
Has there been any discussion around infinite scrolling for output areas
Yes, and more generally about the notebook itself. THe issue is with widgets, or other elements that attach events to the DOM. You can't not display (or remove) them as it will fail to register some events. If it's a "safe" mime then that would be fine.
Another question would be how to "recycle" things that are high on the page. That could be done, by replacing elements with fixed height divs (measuring just after things get out of sight far enough). Annoying if windows resize but doable.
In the end it would be easier if model was on server side and do lazy loading. But that's far away.
Beyond infinite scrolling, we can just collapse something to be and ask a "would you like to display everything".
@gnestor i really like the idea of the jupyterlab extension, i think this is key to achieve feature parity with Rstudio/Spyder etc. @Carreau do you think it would make sense to use something like https://github.com/NeXTs/Clusterize.js ? it looks like they are using the fixed hight div's trick you suggest, and maybe it would be a fairly lightweight integration? EDIT: i'm going to try and update jupyterlab_table
to use clusterize, i'll open a PR there if I can get it to work.
We are having a related discussion in JupyterLab: https://github.com/jupyterlab/jupyterlab/issues/1587
@blink1073 Thanks!
@themrmax I started implementing react-virtualized in jupyterlab_table! It will be good to compare the performance of both 👍
@themrmax Here is an initial implementation using react-virtualized's table: https://github.com/gnestor/jupyterlab_table/tree/react-virtualized
I also have an example using it's grid which allows for horizontal scrolling but actually uses 2 different grids (one for the header) and attempts to keep their x scroll positions in sync (not very well).
They're both pretty performant but could be optimized:
Nice!
Excellent!
@gnestor very nice! i've gotten basically the same thing working for containerized (i haven't done the headers, but i think i would need to do a similar trick as you) https://github.com/themrmax/jupyterlab_table/tree/clusterize
a big problem i can see with both of our solutions is that when data gets large (i.e. over a few thousand rows), it takes a very long time to load the component. as far as i can tell it's not a problem with the frameworks, the demo on https://clusterize.js.org/ instantly loads 500K rows if they're generated by javascript. is it a limitation of us loading the data into the browser as a single JSON payload, and is there a way we could work around this?
EDIT: Just noticed the inferSchema
function is very slow, i can get a pretty good speedup by just running this over the first few rows (10? 100?)
@themrmax Yes. There is observable latency for 10,000+ cells for me. There is no latency when rendering as a pandas HTML table and this is partly because the pandas HTML is rendered kernel-side. A few thoughts:
function loadMoreRows({ startIndex: number, stopIndex: number }): Promise
and theoretically this callback would request a new range of rows from the kernel (e.g. df[startIndex:stopIndex]
). The only way that I can imagine this working across kernels (and beyond pandas dataframes) is if loadMoreRows
is defined by the user (or the display function) and provided to jupyterlab_table via JSON:{
"resources": [],
"metadata": {
"startIndex": 0,
"stopIndex": 1000,
"loadMoreRows": "df[startIndex:stopIndex]"
}
}
Assuming that the extension can communicate with the kernel (which I know is possible but don't know how to implement), the extension could parse this loadMoreRows
string and execute df[1000:2000]
on the kernel and either return the rows to the loadMoreRows
callback or asynchronously update the display using the new update_display
feature. This is pretty hacky... @rgbkrk Any thoughts about how to accomplish infinite scrolling across kernels?
inferSchema
which will attempt to infer a schema if not provided in the payload. I just optimized that by taking a sample of 10 rows vs. iterating through all of them. That just improved performance by about another order of magnitude (because 10,000x10 took about 8s before, and now 100,000x10 takes about the same👍 ). pandas will soon support JSON Table in which case inferSchema
will no longer be necessary for pandas Dataframes. Any thoughts about how to accomplish infinite [paging] [of tables] across kernels?
I'd certainly like to see it.
I'm hopeful that we can make some simplified mechanics with the setIn
based models approach, coupled with the VDom stuff.
Sometimes I would like to view a large dataset inside the output of a cell. If the cell contains a lot of data, it's slow to change focus to/from the cell. For example if I run a cell with the following, once it's completed it takes around 2 seconds to focus in or out of the cell (tested on Chrome and Firefox on a Macbook Air). This is especially frustrating when I'm trying to quickly navigate around the notebook with "J" and "K".
A related issue, if I try and display an equivalently large HTML table (i.e. the display output from a pandas dataframe), the whole browser window seizes up. Maybe this is a limitation of using HTML to display tables, however I think we need to have a way to browse/view large tabular datasets to avoid the need for an external depenency on something like Excel.
EDIT Just realized if I toggle the output of the cell, there is no problem navigating across the cell, so maybe it's not such a big problem.