Drop modelDB and use Yjs directly

davidbrochart commented 5 months ago

Problem

JupyterLab has an abstraction layer called modelDB in the observables package, which currently serves as a proxy for the Yjs shared types. At the time real-time collaboration was being implemented in JupyterLab using Yjs, there was the question of whether to drop modelDB or keep it as an abstraction layer on top of Yjs, and it was decided to keep it so that another CRDT backend could be used, if needed. There are a number of issues with that:

Yjs has features that modelDB doesn't have, which therefore cannot be used, for instance:
- deep observability,
- transaction support.
since data lives in both models, memory is duplicated.
since there is a double binding between modelDB and Yjs, special care must be taken in order to avoid circular dependencies.

All of this results in additional complexity, for really no benefit because JupyterLab already depends on jupyter-ydoc which depends on Yjs, so JupyterLab is not CRDT-agnostic (although I agree the CRDT-specific part is well contained in jupyter-ydoc). Additionally, as Jupyter is moving towards server-side execution, new issues start to appear because the modelDB abstraction doesn't exist in the backend, where we directly use pycrdt (Python bindings to Yrs, the Rust port of Yjs). Therefore we cannot communicate at the modelDB level between the backend and the frontend. Finally, the possibility of using another CRDT library in the future is getting smaller as Yjs is getting more and more popular. The only real alternative is Automerge that a team of JupyterLab developers evaluated and which was eventually discarded.

Proposed Solution

I propose to drop modelDB altogether from the JupyterLab code base, and directly use Yjs instead. See https://github.com/jupyterlab/jupyterlab/issues/16481 for more discussions.

@jupyterlab/jupyterlab-council Let's vote:

Jason Weill (@JasonWeill):
- [ ] yes
- [ ] no
- [x] abstain
Afshin Darian (@afshin):
- [ ] yes
- [ ] no
- [ ] abstain
Alex Bozarth (@ajbozarth):
- [ ] yes
- [ ] no
- [x] abstain
Andrii Ieroshenko (@andrii-i):
- [ ] yes
- [ ] no
- [ ] abstain
Steve Silvester (@blink1073):
- [ ] yes
- [ ] no
- [ ] abstain
Nicolas Brichet (@brichet):
- [ ] yes
- [ ] no
- [x] abstain
Damián Avila (@damianavila):
- [ ] yes
- [ ] no
- [ ] abstain
Eric Charles (@echarles):
- [ ] yes
- [x] no
- [ ] abstain
Brian Granger (@ellisonbg):
- [ ] yes
- [ ] no
- [ ] abstain
Eric Gentry (@ericsnekbytes):
- [ ] yes
- [ ] no
- [ ] abstain
Frederic Collonval (@fcollonval):
- [ ] yes
- [x] no
- [ ] abstain
Fernando Perez (@fperez):
- [ ] yes
- [ ] no
- [ ] abstain
Gabriel Fouasnon (@gabalafou):
- [ ] yes
- [ ] no
- [ ] abstain
Isabela Presedo-Floyd (@isabela-pf):
- [ ] yes
- [x] no
- [ ] abstain
Paul Ivanov (@ivanov):
- [ ] yes
- [ ] no
- [x] abstain
Jason Grout (@jasongrout):
- [ ] yes
- [ ] no
- [ ] abstain
Jeremy Tuloup (@jtpio):
- [ ] yes
- [ ] no
- [ ] abstain
Michał Krassowski (@krassowski):
- [ ] yes
- [x] no
- [ ] abstain
Martha Cryan (@marthacryan):
- [ ] yes
- [ ] no
- [x] abstain
Mehmet Bektas (@mbektas):
- [ ] yes
- [ ] no
- [ ] abstain
R Ely (@ohrely):
- [x] yes
- [ ] no
- [ ] abstain
Rosio Reyes (@RRosio):
- [ ] yes
- [ ] no
- [ ] abstain
Sylvain Corlay (@SylvainCorlay):
- [ ] yes
- [ ] no
- [ ] abstain
Zach Sailer (@Zsailer):
- [ ] yes
- [ ] no
- [ ] abstain

EDIT: I added @RRosio, @SylvainCorlay and @Zsailer to the list of voters, according to this list.

fperez commented 5 months ago

Please take my comments with very limited value as I'm not deeply knowledgeable about this specific part of the codebase, and I may well be missing an important conterpoint.

But a while back in Jupyter (before the name even existed, back when it was all IPython) we tried to really stick to a YAGNI principle, as we were bitten by premature and unnecessary generalizations. A clear example was our early adoption of a concept of "pages" in the notebook format without any UI to actually support it. Eventually there was an Emacs frontend that supported that feature, and it wasn't great when we decided to remove it and offer other simpler navigation functionality, as it meant deprecating something that (small, but still important) user community might have come to rely on.

This seems like a perfect opportunity to apply that stricter principle and remove this otherwise unnecessary layer that is having current real costs for a supposed future benefit that we may never actually realize/need.

jtpio commented 5 months ago

Thanks @davidbrochart for opening the issue.

As discussed in https://github.com/jupyterlab/jupyterlab/issues/16481, one of the main challenges with this proposal will be the breaking changes it will introduce, and the need for a new major release of the related package(s).

Posting my comment from https://github.com/jupyterlab/jupyterlab/issues/16481#issuecomment-2181085345 here for visibility:

We should be very careful about this, because extension authors already had to migrate their extensions twice in a rather short period of time with JupyterLab 3 (released in December 2020) and JupyterLab 4 (released in May 2023). For lab 3 the main change was about the packaging of extensions. But lab 4 came with a long list of breaking changes in various packages: https://jupyterlab.readthedocs.io/en/latest/extension/extension_migration.html#jupyterlab-3-x-to-4-x

Based on the various discussions in the last dev meetings, it sounded like the general feeling was to wait as long as possible before making breaking changes that would have a direct impact on extension developers. And avoid extension development fatigue and extension compatibility issues for end users (with extensions only compatible in an old version of lab).

If breaking changes can't be avoided, maybe there could be a way to improve the extension migration script to automatically perform the necessary changes on an extension code base. This would help extension authors migrate their code more easily with the less friction.

Also, it would be useful to provide some examples of the kind of changes extension authors should expect, for example in a form of a before / after or diff.

krassowski commented 5 months ago

I very much support @jtpio. I am afraid that some folks my not appreciate how dire the situation is and how exhausted extension authors are with constant breaking changes, and users with the extensions not working correctly or at all in the supported versions.

I am also afraid that this can increase the learning curve for developing extensions which interact with documents. Already with the previous migration changes were made without documenting how to do basic interactions with the documents, nor proper migration guidance:

https://github.com/jupyterlab/jupyterlab/issues/13427

Edit: and there are outstanding issues that broke things in editors severely degrading the UX for me that have not been fixed since, like:

After this change, an extension author will need to learn one more thing which is the internals of yjs, and what yjs is in the first place. Right now they are not required to do so.

I will likely be voting "No" but not because I oppose this change in principle and will oppose it always - I think that the advantages of the change do no outweigh the risks at this time, this can change if we come back to this in a years time for example.

Quoting from https://github.com/jupyterlab/jupyterlab/issues/16481#issuecomment-2173953416

For instance with Yjs, changes to a nested shared type can be observed: [...] But not with ModelDB: [...]

In principle this could be solved by extending the ModelDB, right? We could have changedDeep signal as well.

davidbrochart commented 5 months ago

In principle this could be solved by extending the ModelDB, right? We could have changedDeep signal as well.

Maybe but it's not the only missing feature.

echarles commented 5 months ago

A few years ago, we already had those discussions, and luckily we have today a ydoc package that proves that the abstraction approach makes sense. Other projects (SyncedStore...) outside of JupyterLab in the meantime have also successfully taken the abstraction approach, creating collaborative data structures, or distributed state management solutions, on top of Y.js.

To be coherent, this proposal should be extended to the ydoc package. ALL the collaborative data structure of JupyterLab should be architectured the same way. Therefor, this proposal should be simply be closed and reopened covering both the observable and ydoc packages, otherwise JupyterLab core and extensions developers will have to deal with an architectural bicephal implementation and thinking.

Also, it would be useful to provide some examples of the kind of changes extension authors should expect, for example in a form of a before / after or diff.

I was thinking to ask the same thing on the initial discussion but suddenly has been taken in this vote process... anyway, thank you @jtpio, this will help voters to make sense of the expected changes. Would it be possible to give 2 examples: a simple use case, and a complex use case?

As additional information, could you also please list the impacted packages and identify the impact on the external jupyterlab repositories (including the non-official extensions), this will help voters.

This seems like a perfect opportunity to apply that stricter principle and remove this otherwise unnecessary layer that is having current real costs for a supposed future benefit that we may never actually realize/need.

@fperez this proposal is about removing the abstraction for the observables package only, and keeping it for ydoc, which brings questions about the validity of the raised arguments. This proposal would lead to bicephal code which does not make sense to me.

In principle this could be solved by extending the ModelDB, right? We could have changedDeep signal as well.

@krassowski Like you and @jtpio I am much worried on the impacts for JupyterLab third-party consumers and extenders. You may be answered with a some crafted migration tooling (the complexity of the current tooling is btw the biggest complaints I hear), release and communication plans, by this proposal supporters.

My strongest concern is however the proposed architecture. It looks to me that instead of making better an existing good and working thing, it it proposed to drop it and replace it with something weaker and bringing others and more issues, leading to non-coherent implementations (raw yjs for observables, abstraction layer for ydoc).

I am voting -1.

(to be clear, even if this proposal would cover both observables and ydoc, I would be voting -1 as favoring abstraction for that specific aspect).

davidbrochart commented 5 months ago

we have today a ydoc package that proves that the abstraction approach makes sense

It only proves that we made it work with the modelDB abstraction, but as I said there are a number of issues with that.

To be coherent, this proposal should be extended to the ydoc package

Of course I would also drop modelDB in jupyter-ydoc.

@fperez this proposal is about removing the abstraction for the observables package only, and keeping it for ydoc

You must have misunderstood me, jupyter-ydoc must be updated accordingly, as any other extension that depends on modelDB.

it it proposed to drop it and replace it with something weaker and bringing others and more issues, leading to non-coherent implementations (raw yjs for observables, abstraction layer for ydoc).

How is Yjs weaker than modelDB? What issues does it bring? Again, it seems obvious that jupyter-ydoc won't have any abstraction layer anymore.

echarles commented 5 months ago

@davidbrochart Can you clarify. Do you want to drop YNotebook as part of this proposal ? https://github.com/jupyter-server/jupyter_ydoc/blob/b3905b3aef1e6f821819e8ed120370145b0bed28/javascript/src/ynotebook.ts#L38-L40

davidbrochart commented 5 months ago

Of course not, but YNotebook would not use modelDB anymore.

echarles commented 5 months ago

Turning round again and again... the arguments you are using to justify the change you propose don't make sense to me and this change is just driving to bicephal implementations (raw yjs in some case, and abstraction in other case). I will further abstain on this thread and your replies to avoid cluttering the discussion.

davidbrochart commented 5 months ago

Sorry if I'm not clear enough. I just want to drop the modelDB abstraction, both in JupyterLab and jupyter-ydoc. No abstraction layer anymore, I hope that this is clear.

fcollonval commented 5 months ago

teasing on

We can actually close this one and https://github.com/jupyterlab/jupyterlab/issues/16481 too. As it is solved by https://github.com/jupyterlab/jupyterlab/pull/12695 + https://github.com/jupyterlab/jupyterlab/pull/13168 + https://github.com/jupyterlab/jupyterlab/pull/13247

for those in doubt we are still using ModelDB (and we don't), just search the code.

teasing off

When refactoring the API to ease the integration of the shared models, Carlos and I after various discussions and reading of various issues and PRs decided to create ydoc to have a semantic API hiding YJs to ease the understanding of the developers. As any abstraction, it can definitely be improved and no solution are provided to deal with advanced features of YJs like transaction (although that should not be needed if the semantic API is well done) or the undo manager (that one can be tricky).

The data duplication between a JupyterLab model and a shared model is still happening for the cell output model and the attachments (search for globalModelDBMutex). This was left as is by lack of time and to not block 4.0.

If changing those parts (and the outputs is probably what you @davidbrochart have in mine) is the goal of this, then drafting a PR to see the extend of API breaking changes sound reasonable and not too expensive.

Regarding using Yjs directly, I'm strongly against it because there are lots of subtle things about it and having a semantic layer as ydoc is much better for developers.

[!NOTE] ydoc does not depend on JupyterLab objects; hence not on the observables package.

    "dependencies": {
        "@jupyterlab/nbformat": "^3.0.0 || ^4.0.0-alpha.21 || ^4.0.0",
        "@lumino/coreutils": "^1.11.0 || ^2.0.0",
        "@lumino/disposable": "^1.10.0 || ^2.0.0",
        "@lumino/signaling": "^1.10.0 || ^2.0.0",
        "y-protocols": "^1.0.5",
        "yjs": "^13.5.40"
    }

davidbrochart commented 5 months ago

When I say using Yjs directly, I don't mean getting rid of jupyter-ydoc's API, but not using the observables package as an abstraction layer on top of Yjs in JupyterLab.

ellisonbg commented 5 months ago

Overall, I am in favor of this proposed direction, but it sounds like there are still technical details to work through. However, I also agree with others comments about the need for us to be careful about breaking APIs. @davidbrochart could you update the proposal with more information about what parts of this work can be done in the 4.x series of releases (backwards compatible) and which parts would need to go into major releases (presumably 5.x). For example, can we switch to having Yjs be the single source of truth for all of our own code, but still offer a ModelDB compatibility layer for the rest of the 4.x series? I am generally aware of a number of other backward incompatible changes that are needed across different JupyterLab and Jupyter Server repos, so we can't avoid breaking APIs forever. It maybe helpful for planning purposes for us to begin mapping out upcoming breaking changes, so we can better assess the timing and tradeoffs.

fcollonval commented 4 months ago

When I say using Yjs directly, I don't mean getting rid of jupyter-ydoc's API, but not using the observables package as an abstraction layer on top of Yjs in JupyterLab.

Is it the observables or the Lumino signals you want to get ride off? If this is the Lumino signals, I'm strongly against using two notifications mechanism. Lumino signals is the JupyterLab standard way to connect code pieces since the beginning of JupyterLab and we should avoid having multiple communication mechanisms in public API.

davidbrochart commented 4 months ago

Is it the observables or the Lumino signals you want to get ride off?

Only the observables package. Yjs doesn't have the equivalent of signals. But the changes emitted through these signals would have the Yjs delta format, not the observables change format. This is already what we do in jupyter-ydoc, for instance here.

jupyterlab / frontends-team-compass

Drop modelDB and use Yjs directly #253

Problem

Proposed Solution