Real Time Collaboration Plan

saulshanabrook commented 4 years ago

Over the past few years, many folks have been working on bringing real time collaboration to JupyterLab. It would support new features like:

Allow multiple people to simultaneously work on a notebook (or other document) at once, ala Google Docs
Support working on the same document across multiple monitors, with multiple browser windows open
Undo/redo
Saving outputs in a notebook after you close the window

The current work is in the datastore package in Lumino and in a PR to JupyterLab (https://github.com/jupyterlab/jupyterlab/pull/6871).

Moving forward, we could could move this work out into two new separate repositories:

lumino-datastore: General purpose frontend library to support real time collaboration. Either depends on lumino datastore or we move it into this repo. It does not depend on anything Jupyter related. We would move Vidar's existing server side patch proxy work to this repo, from the JupyterLab PR. We would also move Ian's work on datastore helpers to this repo. The idea is that any web app could use this package to add a real time synchronized datastore to their app. Prior art: https://github.com/redux-orm/redux-orm https://github.com/grrowl/redux-scuttlebutt https://github.com/automerge/automerge https://github.com/DevResults/cevitxe. We could put this Lumino itself, but having it as a seperate repo can make it better from an external marketing perspective as well as allowing us to more freely include third party dependencies. Marketing wise, the idea is that, like those other projects, users can come to the Github repo and understand the sole use of the project.
jupyter-datastore: The Jupyter Datastore package gives you an up to date data model of the Jupyter Server data structures in your browser. We would move the other work Ian has been doing from the JupyterLab PR into this repo. It could also add a server side component that talks directly to the server. This would help us achieve the last goal of saving outputs to a notebook even if a client is not open. It is meant to be a building block for any Jupyter web UIs.

Here is a drawing I put together to try to show how these different pieces could work together:

Zach and I also started to sketch out the start of the jupyter-datastore APIs, including the REST api and the tables

Adding these new repos has some advantages:

Adds discoverability to the process, allows others to know how its going by following issues and code changes.
Makes it easier to build a community around the lumino datastore work, by creating its own repo to show off stand alone code examples and documentation. The hope would be make this play nicely with the rest of the JS frontend ecosystem, so that we can gain contributors from other projects and make it more widely used. This is important for its long term viability, because its a complicated piece of code to maintain.
Allows us to have time and space to experiment with the desired low level Juptyer datastore APIs before we figure out how they fit inside of JupyterLab. It's less work than keeping a PR up to date by constantly having to rebase.

However, it will come at the expense of more maintenance burden, having to set up our own build and testing infrastructure for each repo. And it might be potentially confusing, if folks are not sure what the scope is of the different repos. Its also harder to create cross-repo changes, because it requires coordinating pull requests.

I propose creating these two new repos on the JupyterLab organization and create issues and milestones to track what needs to be done on each. Before that can be done, we have to come up with names for each. Current are this, but we could change them:

lumino-datastore
jupyter-datatsore

cc @vidartf @ellisonbg @afshin @Zsailer

Does anyone have objections or name ideas?

vidartf commented 4 years ago

Regarding moving the work to a new repository, I agree with the intent and proposed structure, but want to mention that it might have been easier if lumino was registered as an org (under the jupyter umbrella similar to jupyterlab, jupyter-widgets, and jupyterhub orgs). Not sure how feasible this is. So other than that we now inline-namespace org names instead of having them be orgs, I agree with the names.

having it as a seperate repo [... allows] us to more freely include third party dependencies

I'm not sure why we want to include third party dependencies. One of the clear strengths of lumino is its non-exposure to leftpad. I'm also not sure why changing the repo should change the philosophy w.r.t. this.

Speaking of the client/server setup, I would argue strongly for keeping any and all Python code out of the lumino repos. I would also argue strongly for not requiring Node to run the jupyterlab server. We can discuss this to great lengths in a separate thread though.

Final note: For attracting more contributors, I think the main barrier is access to good documentation (beyond just API docs, e.g. examples on use, tutorials, architecture overview, documenting how our variant of the CRDT algorithm works). While structuring things separately might gives some advantages, I would desperately prioritize time on writing docs and examples. Such efforts also tend to highlight any pain points in the API, so it would be good to start this sooner than later.

saulshanabrook commented 4 years ago

I'm not sure why we want to include third party dependencies. One of the clear strengths of lumino is its non-exposure to leftpad. I'm also not sure why changing the repo should change the philosophy w.r.t. this.

For example, if we add integration of the datastore with react components or RXJS observables, then these become dependencies.

Speaking of the client/server setup, I would argue strongly for keeping any and all Python code out of the lumino repos. I would also argue strongly for not requiring Node to run the jupyterlab server. We can discuss this to great lengths in a separate thread though.

So maybe we call it not lumino-datastore then but my-fun-RTc-clientside-data-thing-name and it depends on @lumino/datastore which still lives in lumino.

I think it would be nice for new users, coming to whatever the repo is, to be able to use the tools to build their own web app that is RTC. And to do that, they need some sort of server which handles relaying patches.

saulshanabrook commented 4 years ago

Notes from meeting with @blink1073 and @vidartf

to figure out:
- Should we use lumino datastore or another CRDT library? like https://github.com/automerge/automerge
  - Two existing lumino datastore issues
    - unpaired surrogates unicode
    - paste issues (unicode ID lengths)
      - maybe not production blocker
  - Reason we created our own is because all existing libraries didnt have features and scalability
    - Features:
      - Mutable blob of text
      - Type safety
      - scalability of amount of clients
    - Maybe brian can give us more things we are looking for
    - We could make a profiling suite to simulate inputs and see how long things take.
- Where to put this code?
  - jupyterlab/rtc-incubator monorepo where we can put all of these parts
  - Definately seperate JS package for lumino that doesn't require Python or Jupyter
- Should we have a server side datastore client? And if so, should it be in Node or in Python?
  - Patch server is just networking, doesn't need to know about patches.
  - Should we require node?
    - A number of LSP servers use node.
    - It's not node that is causing our issues, but yarn and webpack.
Tasks/questions
- Review other libraries
  - Profiling for lumino datastore, so we can compare against other implementations
  - Should we spend time reviewing another library or getting ours working?
  - If we want to attract non JupyterLab users, we need to come up with a documentation suite and full examples.
    - And we will have larger support burden.
  - We could make a virtual layer to switch between lumino datastore and automerge.
- Can we require node to be installed?
  - Can we write server side client so that we can use it in browser as well.
  - Write it first for server side then see if there is user demand.
- Should we use another protocol mechanism instead of building our own on websockets and http.
  - RPC / pubsub
  - Replicated logs:
    - https://github.com/automerge/hypermerge (https://github.com/mafintosh/hypercore)
Potential next steps to discuss tomorrow:
- Create new repo
- Create issues on repo to correspond with next tasks
- Figure out availability

bollwyvl commented 4 years ago

Exciting!

I may have to (somewhat jokingly) take exception to a websocket server being a hard requirement in the first place. Over Thanksgiving, this hacked itself together:

https://github.com/deathbeds/jupyterlab-dat

Obviously very wip, but yeah, it pretty much does the thing: a reasonably usable notebook pub/multisub and ephemeral chat built on dat that likely could integrate into jyve and be served from GitHub pages... Or dat itself.

Alice publishes the live state of her notebook to Bob by sending her public key, Bob subscribes, they find each other in the swarm and a naive stream of nbexplode files are passed around. If Bob then reverses the process (potentially through the in-lab chat), they can copy cells back and forth between the two notebooks.

Eve can discover a derivative of the public key (discovery key), and can therefore prove that A/B were talking about... Something... At some velocity and volume... But can't determine the value.

nothing comm-based works, yet, but other mime renderers work. It can't do multi-client editing, but I think that's just one gnarly webpack away with hypermerge (by the automerge folks, who blessedly work in ts).

Ok, ok, so it does need a static file server, and usually a websocket server: It needs a peer discovery mechanism (I ship one with jupyter-server-proxy) but once connected, everything happens over webrtc.

The Dat protocol is also good at really big files, though likely not in the browser. Sadly, however, the non-node/web clients are somewhat neglected, so you'd be stuck shelling out and working with the file system in most kernels. However, the node-based tooling can be webpacked (a la jlpm) down to under 2mb.

P2p stuff aside, which would be inappropriate in a number of situations, if a novel server must be implemented, the reference server requirement being on node/v8 is fine, so long as

it can also be distributed as a single script
the specification is machine-readable enough that it can be implemented and tested for conformance in another language.

The high road, though, would be something that compiled to wasm, but that's a whole other kettle of fish.

Even our current yarn/webpack (if we were more authoritarian on bundle discipline) doesn't have to be that bad, it's the end-user npm.org connectivity that remains my biggest issue. I think if we could get to yarn pnp, it could be reasonable, as that model would be pip/conda resolvable: instead of a giant node_modules tree is indeterminable depth, we'd just be filling a flat directory of tarballs. Pika is also interesting, but probably not ready for prime time. But I haven't explored these options.

Looking forward to the developments!

vidartf commented 4 years ago

@bollwyvl I can't really tell if you are recommending something to be used for RTC, or just doing a tangential discussion.

saulshanabrook commented 4 years ago

We had another chat about this when @afshin and @jasongrout came back.

Jason said that we could start out by just having it on the client side, and having the server side state management solution there as well, so we don't need node on the server. It won't actually give RTC, but can serve as a base that then we could switch to a server based version after we implement it all client side.

saulshanabrook commented 4 years ago

@bollwyvl The dat stuff is cool, I have seen this project implement CRDT on top of it https://github.com/automerge/hypermerge

Another idea is to make the RTC backend pluggable, so you could use different transport protocols or algorithms if you want.

choldgraf commented 4 years ago

@bollwyvl @saulshanabrook FWIW I was chatting with one of the dat folks a while back about RTC. They thought it sounded quite interesting. Would it be helpful to make a connection? It'd been a few months but maybe they'd still be interested

vidartf commented 4 years ago

Could someone explain what dat is, and what problems it will solve? I'm not keeping up on the trends (:

saulshanabrook commented 4 years ago

@choldgraf I think so, but let's wait a little till we have the RTC repo set up and a better idea of how to integrate a dat backend with all our existing work.

Not an expert, but dat lets you like sync data between hosts basically, over different protocols. It's a bit like torrents? or ipfs? https://dat.foundation/

saulshanabrook commented 4 years ago

Notes from chat with on Jupyter call with @ellisonbg @vidartf @vidartf @jasongrout and others:

Should we use node?
- For now, let's experiment with this in the incubator repo. But we might wanna re-write in Python later once we have it all working.
Design goals for Lumino datastore (when thinking about other options)
- RTC
- Data models that we have in JL
  - notebook-ey like things
- Extensible with new data models
- Schema/strong type based approach
Should we use automerge?
- It is still pretty exploratory
- How can we make ID generation more flexible to support different use case? Same sort of questions come up in automerge.
Where should we put code?
- We can always move them around lets be flexible for now.
- Separating out lumino datastore RTC library into own repo might have marketing benefit. If
datastore bugs
- unpaired surrogates
- on serialization, encode as base64, then decode on de serialization or use Ian's custom serialization.
- Patch size issue

saulshanabrook commented 4 years ago

I have created a new repo on this org for our RTC work: https://github.com/jupyterlab/rtc. Please create issues on that repo for all new RTC discussions. At some point, we might wanna move parts of it into other repos once the work hast stabilized.

jupyterlab / frontends-team-compass

Real Time Collaboration Plan #30