jupyterhub / binderhub

Run your code in the cloud, with technology so advanced, it feels like magic!
https://binderhub.readthedocs.io
BSD 3-Clause "New" or "Revised" License
2.56k stars 390 forks source link

The need for persistence #1003

Open betatim opened 4 years ago

betatim commented 4 years ago

This issue is about working on making it easier to save a currently running Binder session as well as restoring/restarting a new Binder session from that state at a later point in time.

Right now when a user's binder session times out they lose their work. This fits the "Binder sessions are ephemeral" goal but having a way to save/restore your work would be a great feature to have. In particular for public deployments like mybinder.org where the timeouts are set to fairly short times.

Below some ideas that have been discussed previously with some pros and cons. This issue is about collecting additional ideas with their pros&cons as well as zeroing in on something simple that we can implement and test drive. It "only" has to be better than what exists now to get my support (the perfect being the enemy of the good etc). We can go for more ambitious solutions in a second iteration.

Show time till timeout

The idea is to display a countdown in the UI that lets users know how long they have left till things timeout. It would also give feedback about which actions reset the timer as people would be able to see it reset. It sounds like it should be simple to implement but I don't know if that is actually true. Can the UI access a (lower bound) on how much time is left and notice that it has been reset?

If we can access or compute how long is left this would be a nice first solution that would hopefully be simple to implement as a Jupyter notebook and Lab extension. For other UIs it would be more tricky to do.

Upload pod state to a blob store

We could execute a script via the preStop lifecycle hook of kubernetes. This script could then upload the state of the home directory (/home/jovyan) to a blob store. We'd need to find a way to tell users where to download this blob after the binder has timed out. It is also not clear that the time window the preStop script has is enough to upload everything. Unclear how a user would resume from such a download.

"Save as" uses notebook state in the browser

The state of a notebook that is open is available to the browser even after the server has gone away because the state of the notebook is only stored there. This means we could have a notebook extension that lets users save/download the notebook they are looking at even after the server has gone away. I am not sure what would need to happen or where to start. Drawbacks include that it would only cover Jupyter frontends and data files would be lost.

I will keep adding to this thread over time but please do post your own ideas and thoughts on any of these. I will try and dig out the relevant issues for the ideas that have been previously discussed so that we can pick up things from those discussions.

manics commented 4 years ago

Another idea: use a browser's LocalStorage. One downside is that it's at the domain scope not the notebook / URL scope, so need to be careful about automatically restoring state. Could perhaps make it a prompt (Do you want to restore your previous state?).

betatim commented 4 years ago

You'd store the current notebook state in local storage and then when someone opens the same binder (or notebook) we'd ask the user if they want to restore from the local state? Would this be something to implement as a notebook extension (via some JS)?

I like the idea, the hardest part could be recognising that a notebook is the same.

manics commented 4 years ago

Maybe this should be split into two topics:

betatim commented 4 years ago

I'd keep it as one issue to collect all the ideas and their pros&cons. Then if there is consensus on what to start with make a new issue to implement this.

At least for me what to store, for how long and where are parts of the trade-offs we can make. "Everything on my personal dropbox" being maybe one end of the extreme and "nothing, nowhere, never" at the other end.

My guess would be that by being able to recover the currently open-in-the-tab notebook we'd already make a lot of people happy. Without any special upload functionality or anything. Just don't know where to get started on that (I assume we need some JS code for this which makes it a notebook extension?)

manics commented 4 years ago

Quick proof-of-concept (tested on Firefox):

  1. Load a repository in binder, open a notebook, make some changes, save it.

  2. Open your browser's JavaScript console for that page

  3. Paste this into the JS console and run it:

    Jupyter.contents.get(Jupyter.notebook.notebook_path, {type: "notebook", content: true}).then(function(value) {
    console.log(value);
    localStorage.setItem(Jupyter.notebook.notebook_path, JSON.stringify(value));
    }, function(value) { alert("Failed to get notebook"); } )

    It should save your current notebook into localstorage

  4. Load a new instance of the same binder repository, you must be on the same domain (if the federator directs you to a different binderhub this won't work). Open the same notebook.

  5. Open your browser's JavaScript console for that page

  6. Paste this into the JS console and run it:

    loaded = JSON.parse(localStorage.getItem(Jupyter.notebook.notebook_path));
    Jupyter.notebook.fromJSON(loaded)

If it works you should see your notebook with previous changes!

I think https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API would be a better option than localstorage though, it's more complicated but it's designed for storing much larger amounts of data.

As you've already mentioned the biggest problem is finding a repository identifier when storing the notebook.

betatim commented 4 years ago

I am unreasonably excited by this :) Is this the moment where we make a new issue for "Browser based storage of notebooks"? (I'd say yes)

What do you think of the following: we hook into the "save" event of each notebook and store stuff to "browser storage" (IndexDB or LocalStorage or ...).

I could see two alternatives for letting users get their stuff:

In order to offer the user a "want to restore from browser storage?" option when they open a notebook it would be nice if we had a unique ID in the notebook metadata. Maybe a (big) random number that is written when the notebook is first created. You could then use that as key. I will create an issue in the notebook format repo to see if we can start a discussion on this.

manics commented 4 years ago

Is this the moment where we make a new issue for "Browser based storage of notebooks

Sounds good to me!

In order to offer the user a "want to restore from browser storage?" option when they open a notebook it would be nice if we had a unique ID in the notebook metadata

I think a repository identifer would be useful alongside a notebook UUID:

betatim commented 4 years ago

There are now REF_URL and REPO_URL on mybinder.org (via https://github.com/jupyterhub/mybinder.org-deploy/pull/1202) that let you know which repo you are in and the info to start a new binder again. So we could store that together with the notebook path.

What do you think of starting with just notebooks in the browser storage? It feels like if we include arbitrary files we need to figure out something to prevent (very) large files filling up the browser storage/need a good UI to let people inspect/manage it.

The domain thing is a shame but as long as we start with the "store in your browser" feature being an optional extra I feel like we can get started instead of having to find a solution to this from day one.

betatim commented 4 years ago

To show the time left till the timeout we can talk to curl -H "Authorization: bearer $JPY_API_TOKEN" http://hub:8081/hub/api/users/$JUPYTERHUB_USER from inside the container/notebook and it will tell us the last activity timestamp from which we can start a countdown with ~8min as countdown time or some such. It is a lower bound because each kernel first has to time out etc. So this is a "simple fix" only.

ivan-gomes commented 4 years ago

Is there any way we could do this extensibly - Contents API comes to mind - such that one could implement persistence with a blob store like S3 in addition to browser to stretch beyond a single browser and computer?

betatim commented 4 years ago

Sounds interesting. Could you re-post/link you comment in #1007 @ivan-gomes ? Using a remote storage like S3 sounds interesting but like something I'd postpone until we have a first version that works with users. Trying to figure out auth and such could be tricky :-/

I don't see this being implemented as a contents API storage (I think there are already options there to use S3 in https://github.com/nteract/bookstore?). We'd use the Contents API to get the contents of the notebook (maybe). Anyway, something for the other issue :)

consideRatio commented 4 years ago

Questions

Is it possible, within for example JupyterLab, to download an open notebook from the UI even though we have been disconnected from the server? Is there such JupyterLab extension already? It sounds like like something useful to create totally separate from creating it specifically for mybinder.org or similar.

manics commented 4 years ago

https://github.com/manics/jupyter-offlinenotebook lets you download an open notebook on any system running jupyter-notebook, it's only the local-storage and binder links that are restricted to BinderHub, though the local storage restriction could be relaxed to work on any system.

PRs adding Jupyterlab support would be very welcome 😀

manics commented 4 years ago

@consideRatio It's not as polished as the notebook extension... https://mybinder.org/v2/gh/manics/jupyter-offlinenotebook/master?urlpath=lab%2Ftree%2Fexample.ipynb

almereyda commented 3 years ago

After https://github.com/jupyterlab/jupyterlab/issues/5382#issuecomment-837106904 had been merged, there is a SharedNotebook object now that one could use to persist a notebook, too.

The server doesn't yet use the implemented SharedNotebook because it is written in Javascript. However, we are making good progress on the Yjs-Python port, that would allow the kernel to write messages directly to the shared notebook.