Better scrolling for large tabular/text output

themrmax commented 7 years ago

Sometimes I would like to view a large dataset inside the output of a cell. If the cell contains a lot of data, it's slow to change focus to/from the cell. For example if I run a cell with the following, once it's completed it takes around 2 seconds to focus in or out of the cell (tested on Chrome and Firefox on a Macbook Air). This is especially frustrating when I'm trying to quickly navigate around the notebook with "J" and "K".

%%bash
curl http://en.openei.org/doe-opendata/dataset/3e440383-a146-49b5-978a-e699334d2e1f/resource/3f00482e-8ea0-4b48-8243-a212b6322e74/download/iouzipcodes2011.csv

A related issue, if I try and display an equivalently large HTML table (i.e. the display output from a pandas dataframe), the whole browser window seizes up. Maybe this is a limitation of using HTML to display tables, however I think we need to have a way to browse/view large tabular datasets to avoid the need for an external depenency on something like Excel.

EDIT Just realized if I toggle the output of the cell, there is no problem navigating across the cell, so maybe it's not such a big problem.

gnestor commented 7 years ago

I can't reproduce. Please provide a link to an example notebook.

themrmax commented 7 years ago

@gnestor did you try my example code? maybe you just have a faster computer than me?

gnestor commented 7 years ago

Yes, I did. Initially, it was throttled by the default data rate limit (which will ship with 5.0):

screencap1

After overriding it, I was able to fetch the data and it was slow to respond to user events initially (because it's a lot of data and a lot of DOM nodes), but then I was able to change focus with no issue (I tried the j and k keyboard shortcuts too):

screencap1

There isn't much we can do to resolve the notebook being slow when a ton of data is rendered, except for implementing some form of infinite scrolling that can paginate through data and ideally fetch new data from the kernel vs. keeping it in memory in the browser.

themrmax commented 7 years ago

@gnestor yes this makes sense, although it might be a lot of work I think this is definitely a feature we should have, I've renamed the issue accordingly.

gnestor commented 7 years ago

@Carreau @takluyver Has there been any discussion around infinite scrolling for output areas? I'm not sure how much (if any) performance improvements we would get without kernel interaction (e.g. fetching new pages of data from the kernel).

@themrmax If kernel interaction is required, then this feature would not be implemented in notebook but rather ipywidgets or another "infinite scrolling output area" extension that can interact with ipython and ideally other kernels. This could be a good experiment for a kernel-connected mimerender extension.

Carreau commented 7 years ago

Has there been any discussion around infinite scrolling for output areas

Yes, and more generally about the notebook itself. THe issue is with widgets, or other elements that attach events to the DOM. You can't not display (or remove) them as it will fail to register some events. If it's a "safe" mime then that would be fine.

Another question would be how to "recycle" things that are high on the page. That could be done, by replacing elements with fixed height divs (measuring just after things get out of sight far enough). Annoying if windows resize but doable.

In the end it would be easier if model was on server side and do lazy loading. But that's far away.

Beyond infinite scrolling, we can just collapse something to be and ask a "would you like to display everything".

themrmax commented 7 years ago

@gnestor i really like the idea of the jupyterlab extension, i think this is key to achieve feature parity with Rstudio/Spyder etc. @Carreau do you think it would make sense to use something like https://github.com/NeXTs/Clusterize.js ? it looks like they are using the fixed hight div's trick you suggest, and maybe it would be a fairly lightweight integration? EDIT: i'm going to try and update jupyterlab_table to use clusterize, i'll open a PR there if I can get it to work.

blink1073 commented 7 years ago

We are having a related discussion in JupyterLab: https://github.com/jupyterlab/jupyterlab/issues/1587

gnestor commented 7 years ago

@blink1073 Thanks!

@themrmax I started implementing react-virtualized in jupyterlab_table! It will be good to compare the performance of both 👍

gnestor commented 7 years ago

@themrmax Here is an initial implementation using react-virtualized's table: https://github.com/gnestor/jupyterlab_table/tree/react-virtualized

I also have an example using it's grid which allows for horizontal scrolling but actually uses 2 different grids (one for the header) and attempts to keep their x scroll positions in sync (not very well).

They're both pretty performant but could be optimized:

screencap

blink1073 commented 7 years ago

Nice!

rgbkrk commented 7 years ago

Excellent!

themrmax commented 7 years ago

@gnestor very nice! i've gotten basically the same thing working for containerized (i haven't done the headers, but i think i would need to do a similar trick as you) https://github.com/themrmax/jupyterlab_table/tree/clusterize

a big problem i can see with both of our solutions is that when data gets large (i.e. over a few thousand rows), it takes a very long time to load the component. as far as i can tell it's not a problem with the frameworks, the demo on https://clusterize.js.org/ instantly loads 500K rows if they're generated by javascript. is it a limitation of us loading the data into the browser as a single JSON payload, and is there a way we could work around this?

EDIT: Just noticed the inferSchema function is very slow, i can get a pretty good speedup by just running this over the first few rows (10? 100?)

gnestor commented 7 years ago

@themrmax Yes. There is observable latency for 10,000+ cells for me. There is no latency when rendering as a pandas HTML table and this is partly because the pandas HTML is rendered kernel-side. A few thoughts:

I think the real solution is a hybrid-approach between client-side and kernel-side rendering, where the kernel returns a maximum initial payload and the client can request additional pages based on the scroll position/direction/momentum. This is pretty tricky. The react-virtualized interface looks like function loadMoreRows({ startIndex: number, stopIndex: number }): Promise and theoretically this callback would request a new range of rows from the kernel (e.g. df[startIndex:stopIndex]). The only way that I can imagine this working across kernels (and beyond pandas dataframes) is if loadMoreRows is defined by the user (or the display function) and provided to jupyterlab_table via JSON:

{
    "resources": [],
    "metadata": {
        "startIndex": 0,
        "stopIndex": 1000,
        "loadMoreRows": "df[startIndex:stopIndex]"
    }
}

Assuming that the extension can communicate with the kernel (which I know is possible but don't know how to implement), the extension could parse this loadMoreRows string and execute df[1000:2000] on the kernel and either return the rows to the loadMoreRows callback or asynchronously update the display using the new update_display feature. This is pretty hacky... @rgbkrk Any thoughts about how to accomplish infinite scrolling across kernels?

I observed about a 10x performance improvement using react-virtualized vs. fixed-data-table, so there is probably a lot of potential to optimize the client-side rendering. @themrmax Like you said, the Clusterize.js demo appears to load 500k rows instantly, as does the react-virtualized demo. I did some initial profiling and I noticed there was a bottleneck at inferSchema which will attempt to infer a schema if not provided in the payload. I just optimized that by taking a sample of 10 rows vs. iterating through all of them. That just improved performance by about another order of magnitude (because 10,000x10 took about 8s before, and now 100,000x10 takes about the same👍 ). pandas will soon support JSON Table in which case inferSchema will no longer be necessary for pandas Dataframes.

rgbkrk commented 7 years ago

Any thoughts about how to accomplish infinite [paging] [of tables] across kernels?

I'd certainly like to see it.

I'm hopeful that we can make some simplified mechanics with the setIn based models approach, coupled with the VDom stuff.

jupyter / notebook

Better scrolling for large tabular/text output #2049