jupyter / notebook

Jupyter Interactive Notebook
https://jupyter-notebook.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
11.55k stars 4.84k forks source link

Poor performance when saving a notebook over SSH #939

Open sparseinference opened 8 years ago

sparseinference commented 8 years ago

I use the Jupiter notebook on a local Linux system (Ubuntu 15.10) and I run the sever and kernel on a remote Linux VPS (also Ubuntu) through an SSH connection. The server is Python 3.5 (Anaconda).

The connection upload speed is much slower than the download speed and I've noticed that the time to save a notebook is very long (several minutes). The network connection upload rate spikes to about half the available bandwidth and the browser (Chrome 47) slows down so that normal browsing (for documentation for example) is also slowed down to the point of being unusable.

I use Bokeh plots in the notebook, and it is only when I have Bokeh plot output in the notebook that I've noticed that this occurs.

Could the notebook save function be also saving all the Bokeh plot output back through to the server? That seems unnecessary to me.

Can anyone confirm that this is happening? This could be related to Issue #650. In this case too, large plots are present in the notebook output. In my case I don't really want the plot output to be saved because I can always regenerate them and if I want a permanent output I can save an image.

takluyver commented 8 years ago

At the moment, saving is quite naive, and it does indeed send all the data back to the server. This is because only the frontend, not the server, keeps track of the notebook state. Some of the work @Carreau is doing for real time collaboration should make smarter saving possible by holding document state in the server.

sparseinference commented 8 years ago

OK thanks - I suspected that was the case.

Until the collaboration work is ready then, would it be a bad idea to simply avoid sending any cell output back to the server if a flag was enabled somewhere?

Maybe it could work like first clearing all outputs before saving - but without requiring any computed data in variables to be recomputed, of course, since that would make the effort useless.

minrk commented 8 years ago

An extension can clear outputs automatically on save, or even exclude output from every save if desired. I don't think we should be baking this in as a configurable by default, though, as there is a cost to every configurable option.

sparseinference commented 8 years ago

Amazing ... 6 months after I asked a simple question, the issue is closed without even attempting to address the original problem or to enter into any dialog at all about it.

ellisonbg commented 8 years ago

Apologize for that - not sure why it was closed. Can someone provide some background on why this was closed?

I do know that we are waaaay behind on issue triage though

Sent from my iPhone

On Jun 16, 2016, at 8:06 PM, Peter Cavén notifications@github.com wrote:

Amazing ... 6 months after I asked a simple question, the issue is closed without even attempting to address the original problem or to enter into any dialog at all about it.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

Carreau commented 8 years ago

Amazing ... 6 months after I asked a simple question, the issue is closed without even attempting to address the original problem or to enter into any dialog at all about it.

Sorry about that, the "close and comment button" is next to the "comment" one, it's common to miss click on the wrong button, especially when navigating quickly with the keyboard. So annoying that there are even extensions to move it.

It's true also that we are small team, and literally have 100ds of notification each day – just today ~50 issues open, many notification per issue just on GitHub. Though we try our best and we like to have the benefit of the doubt.

The comment that "without even attempting to address the original problem or to enter into any dialog at all about it." seem a bit harsh, as 2 developers from the team did write comment explaining the current situation: and proposing to write an extension.

Anyway looking again at the issue and the already given comments: yes it is an issue we are aware of; we need to have incremental saving/synchronization. Which is not going to happen that soon but is worked on. And like the second, I agree that a configuration option is likely not the right way to tackle, it but that yes, a custom extension for the notebook can perfectly handle the case and strip bokeh plot at save time.

We are likely not going to special case Bokeh, and I know of a few people that will purchase us to the end of the earth with chainsaw if we strip large output by default.

So here is the status quo. there is likely nothing that going to be done for this specific case right now, at least not in core. And it will improve with time. custom.js and various extensions can achieve that, if needed, but we don't provide stable API for it. So your best chance is to go with a custom API, resources online are plentiful, if you need a hand we'll be happy to provide more information.

Thanks,

sparseinference commented 8 years ago

@Carreau @ellisonbg Thanks for responding.

I understand that you are overwhelmed with issues. When I first asked the question I guess you missed that I was asking for advice on how I could help to solve the problem. Since then, I rearranged things to work locally instead of remotely because it was very difficult to be productive on a remote server. Since I don't know the code-base, I was looking for a way to work-around the problem without introducing any more. I didn't know anything about extensions or even that they existed. I also wasn't trying to fix a special case with Bokeh. The problem exists where any cell output in large quantity is produced.

OK, I will search for the online resources about extensions and see how that works. A requirement to work remotely will undoubtedly appear again soon, so it would be nice to be ready.

Thanks,

minrk commented 8 years ago

Sorry for closing prematurely, it was not meant to shutdown discussion, just indicating that this does not represent a task to do on this project. I mentioned extensions as a possible solution, but should have given more detail. As an example, I wrote this one that removes output from the data that is to be saved. I hope that helps.

sparseinference commented 8 years ago

@minrk Thank you very much for the example. That helps a lot to get started.

JamiesHQ commented 7 years ago

Hi @sparseinference : just checking in to see how things went with the extension minrk suggested above and if there's anything else we can help you with on this issue. thanks!

sparseinference commented 7 years ago

Hello @JamiesHQ Thanks for the reminder - I intend to resume looking into that soon.

TuranTimur commented 6 years ago

+1

it seems that it uploads 717b each time uploading. @JamiesHQ would there be any workaround?

shahbazbaig commented 4 years ago

I am also facing same issue. Its taking so long for each operation/command to be executed. Any suggestion to overcome this problem?

LustigePerson commented 3 years ago

For some time now, notebooks have unique cell IDs, right? Perhaps it would even be possible to only sync changed cells this way? Working on a remote server really becomes painful when you start having some plots in your notebook. Is there any Solution for this at jupyter hub? They must be facing this problem even stronger.