jupyterlab / jupyterlab

JupyterLab computational environment.
https://jupyterlab.readthedocs.io/
Other
14.04k stars 3.27k forks source link

Persisting shared-models #10544

Closed dmonad closed 1 year ago

dmonad commented 3 years ago

Problem

After leaving a collaborative session in Jupyterlab 3.1a*, we always start a new collaborative session without retaining the editing history for that document. Internally, we use Yjs to build a shared model (e.g. YNotebook) that can be synchronized with other peers in real-time.

After all users left a session, the editing history of the Yjs model is lost. We don't persist the YNotebook, instead we create a new shared model. However, persisting the shared model has several advantages.

  1. By keeping the editing history, we can restore old versions of the Jupyter Notebook. Tracking changes, similarly to Google Docs, is possible with Yjs only if we retain the editing history.
  2. For the commenting feature, it is generally advisable to use Yjs' relative positions that allow us to "attach" meta-information to ranges of text. This is only possible if we keep the editing history.
  3. There are some edge-cases when people that join a collaborative session after being offline for a long time, can overwrite the document of others with an old document. The reason for this is that Yjs doesn't know which "fork" should be preferred after the document lost its editing history, and therefore just picks one. This issue can be prevented when the document model, including its editing history, is persisted in a database

Keeping the editing history doesn't mean that we end up with huge documents. Yjs optimizes the representation and garbage-collects old edits by default. It can be shown that it is impossible for a human to create an edit history that the browser wouldn't be able to parse in an acceptable amount of time.

Proposed Solution

The RTC server should store the shared model, including its history, in a database. Any database would work. @ellisonbg suggested to use sqlite, but using leveldb would also work.

There are several options:

  1. We could maintain a separate database for each document. The database would be stored alongside the file (alternatively, we could store the database instead of the .ipynb file.).
  2. We maintain one database for all documents in a folder.
  3. We maintain one database for all documents on the operating system. The location for that database could be in the configuration folder for the user (e.g. ~/.jupyter/shared-models.db).

The question is also whether we want to keep the *.ipynb file if the file also exists in a database. The source of truth needs to be the entity that stores the shared model. So changes to the .ipynb file on the filesystem won't be detected anymore. In any case, we should probably start with adding a database as an optional feature, and then, if we agree to, move the notebook format to the database.

Additional context

This extension would work similarly to the y-leveldb adapter. It would store incremental updates in the database that will be automatically merged to a single document entry when needed. This would ensure that each keystroke is immediately synchronized to the filesystem - similarly to Google Docs, where you never lose content, even if you forget to hit the "save" button.


@ellisonbg @SylvainCorlay @hbcarlos

fcollonval commented 1 year ago

Closing this as now we have the ability to store the shared model in jupyter_collaboration using the Store concept.