Benchmarks and comparison to SyncedStore discussion

bogobogo commented 2 years ago

Hey @sep2, Thanks for making immer-yjs!

I have been thinking about making something similar and I am glad someone is already doing that :) (in my ideal world, one could even swap yjs with fluid framework or whatever, but thats for another time)

I have been playing with SyncedStore and I wonder how the two solutions compare in developer experience and performance. Some of my initial thoughts:

Developer Experience Advantages

immer-yjs has a plain object snapshot and translates immer patches to yjs, while syncedstore uses yjs as the store, and creates proxies around them. This means that reading the store is easy, no annoying proxies and cloning issues.
immer-yjs is also opinionated in the sense that an array set calls the yjs splice method by default - which might be not ideal in a collaborative setting. See my discussion here for reference. In any case, that is something that could potentially be configured in the future.
Wrapping everything around a produce plays nicely with transactions, which could be useful with the yjs support for transaction based undo/redo.

Sharing Granularity

The main issue I currently see with the "magic solution" is the crdt granularity. In most cases, you don't want the whole nested object to be composed of nested crdts, as this might create undesirable states. This is solved in SyncedStore with the Boxed values. This works but kind of annoying - you need to constantly wrap objects with a Box and then call .value on everything.

The ideal solution, in my opinion, is defining a schema and then updating the crdt's based on the schema. Are you open to a PR that adds support for something like that?

Performance

How does immer patches and maintaining the snapshot compare to the proxy based solution of synced store? I wonder what would be a good test for that.

Would love to hear your opinion about this, and if you are accepting PR's

sep2 commented 2 years ago

Hi @bogobogo! Thanks for your detailed feedback! 👍

I was too thinking of using fluid framework a while ago, but last time I saw their communication protocol is much more complicated than yjs. I agreed we should stick with raw js API interfaces whenever possible as it will generally last longer than a specific crdt implementation.

I have known about SyncedStore (which is a great project anyway). The main reason I created this library is because I have not much confidence in the mutation-based API used in large react projects. From the experience of using valtio (a mutation-based state library) from my previous project, everything is great and clean at the beginning when the project is small, but once I have to pass down the sub-state proxies to sub-compoents deep in the tree, I quickly get lost what is mutable and what isn't. Also mutation-based API cannot guarantee the atomicity of changes (in valtio it automatically batch updates in a black box way, which may lead to conflicts with react rendering, I am not sure what SyncedStore does regarding this). I feel like mutations should be controlled in a manageable maner, in the corner where they are really needed and stay just there, not by default be exposed to the outer world, at least in the react land this should be more compatible.

About the array set operation, that's a good catch 👍. I don't really want to ban the usage of replacing array element by index as it is a common operation. Configuration is a doable option, another option will be to print a warning message in dev environment when it detects such a usage (but that's annoying too). Another reason I want to allow this even it has strange merge behavior is that I don't think conflict can be avoided in the state API level anyway. E.g. we could easily have stale state just using insert & delete by index, if other users are also inserting/deleting by the same index concurrently. I'm not sure, the best way would probably be solving it in yjs itself.

Regarding the sharing granularity, I understand what you mean, I am just not sure what the API would be like, and how the nested object would be stored in yjs if it is not itself a Y.AbstractType. For example if I store a plain object const obj = {"k":"v"} by map.set("obj", obj), would that be possible? Will y.js serialize this plain object and deserialize it back in the other client side automatically? Or do I have to send the schema along with the object itself to the other side as well?

I feel like this could be solved by a hook API, which allows the user to hook into the YEvents & immer patches handling process and do whatever they want. Then if needed, a schema API could be implemented on top of the hook API, providing sensible defaults for untracked nested object type, Y.Text, Y.XmlElement, etc,. I am absolutely open to PR, but I would like to discuss the API first, what do you think?

About performance, the mutation side should be in par with immer itself, plus an extra round of iterating the patches generated by immer. The reading side should be in par with plain object, since the snapshot is just a plain object after all. But that's just my rough guesses, we definitely need tests on that as well.

I am glad to hear your feedback!

bogobogo commented 2 years ago

Thanks @sep2 for the quick and detailed response.

Mutability and Immutability

I see what you are saying regarding mutability and react. I have personally had positive experiences with mobx based state solutions. That being said, It does lead to confusions. Especially in hook dependencies, where sometimes two objects are considered the same even though in an immutable solution they won't.

Array

Regarding array set, I totally agree with you that this should be solved in yjs.

Schema and Granularity

For example if I store a plain object const obj = {"k":"v"} by map.set("obj", obj), would that be possible?

The answer is yes. While yjs supports nesting CRDT's, you don't have to. you can have a simple yjs map at the root and huge plain nested objects as values (and in some cases that would actually make sense).

API Discussion

I agree that starting with a hook makes sense. Let's take a simple example

const todosById = { 
     someTodoId: {
           title: "groceries",
           completed: false
      }
 }

We have a map of todos. Each todo is itself a map. Let's say we want todosById to be a shared map - users should be able to add todos concurrently. However, if a user changes a todo title, and another user changes the completed status of the same todo at the same time - only one of the changes should occur, and the last write should win. This makes sense UX wise, since the completion status change might be irrelevant after the name change.

So todosById should be a shared map, but each todo should be a plain object. In the current solution, both will be shared, which may lead to bad UX (and in some cases, corrupt states).

API Ideas

Both API suggestions are a rough sketch, and will probably need to be ironed out, but this is the basic idea.

Option 1

The simplest idea is to add a configuration object to bind where we will be able to pass hooks like so:
bind(initialDoc, { applyPatch: (target, patch) => boolean } 
applyPatch will return true if it handled the patch, and false if it hasn't. In that case immer-yjs will handle the patch on its own. The developer can then to try and figure out if the path is relevant, and then update yjs as he wants.

While this will enable what we want, this feels like a feature that is at the core of the idea behind immer-yjs. It will also be quite complex to implement for the developer.

Option 2

Define schema - figure out where each change falls on the schema, and then update based on that. This is my dream solution as a developer.

import bind, {SharedMap, SharedArray} from 'immer-yjs'
const schema = {
  todosById: new SharedMap(),
  todosOrder: new SharedArray()
}
bind(initialDoc, schema}

Both SharedMap and SharedArray could get a different schema as param for nesting.

And that's it. immer-yjs will figure out what is the relevant root CRDT for each patch, and update that behind the scenes.

I believe this can be achieved using the path property provided by the immer patch. It adds complexity to immer-yjs, but makes it a complete solution for real world needs.

Would love to hear your thoughts!

sep2 commented 2 years ago

Update 2: After reconsidered the problem, I am convinced that Option 2 should be the way to go.

Update 1: Option 1 is already implemented with a slightly variated version, please see readme for detail.

Thanks for the clarification of storing plain object in y.js.

API Discussion

It seems to me that option 1 would be trivially enough to implement. Once this is implemented, the option 2 could be implemented as follow:

// with-schema.ts
function withSchema(schema, options) {
    const applyPatch = (target, patch) => {
        // apply the patch with schema
    }

    return {
        ...options,
        applyPatch
    }
}

// usage:
bind(initialDoc, withSchema(/* schema */))

The benefits of this approach

Separating the implementation of bind and withSchema, so they could be independently implemented & tested.
Since yjs doc is inheritedly a schemaless data structure, one could use bind(initialDoc) (omit the second parameter) without worry about schema if they don't want to, which leads to a simpler default API.
1. Schemaless means that in update(fn) one can arbitrarily modify the state (though you probably don't want to as it messes up typescript).
2. One can't guarantee what will come from Y.applyUpdate(doc, update) over the wire, others could potentially mess up the schema too. For example two users running different versions of the same app, which means two different version of schemas will interfere with each other.
One could monkey patch the options object returned from withSchema and enhance it in some ways in the future.
The API is basically same as the original proposal.

Schema format

About the schema format, I would propose a rule-out schema:

const schema = [
    'todosById.*'
    'someKey.nestedKey.otherkey'
]

bind(initialDoc, withSchema(schema))

Each item in the array is a path to the y.js data where the leaf node should be ruled out to convert to Y.Map / Y.Array.
* means a path segment of one level of arbitrary key in object or arbitrary element in array.

The pros

With an empty schema, the default behavior is preserved.
Developers don't have to duplicate the typescript definition in javascript.
The schema itself is serializable so it can be easily integrated into the communication protocol.
Allow Y.XmlElement / Y.Text to be ignored in the same way without introducing extra wrapper classes.
Can we use jsonpath?

The cons

Unspecified paths are by default all converted to Y.Map / Y.Array.
Cannot have a path segment named *. (this is a special character treated differently)
Not trivial to implement.

This is just my initial thought, it contains obvious flaws.

Conclusion

I would like to add option 1 to the next release, this should unblock the impementation of a withSchema() in userland. (The explict schema or the rule-out schema should both be possible)

I'm sorry due to the complexity of schema parsing & validating & testing & maintainance burden, I probaby won't add this function to the library itself unless there are more users requesting it so that it outweighs the efforts. That being said, later I (or others) could probably share a sample implementation of withSchema() in here so everyone could just copy & paste to try it. I will update the readme to add a section about schema linked to here. Let's see if more people would agree on the usage and stablize the API so it can be integrated without much efforts.

Will this solve your needs? Please don't hesitate to share comments.

sep2 commented 2 years ago

@bogobogo After reconsidered the problem, I am convinced that Option 2 should be the way to go. If you would like to submit a PR to implement Option 2, I will be happy to accept it. My only concern is that the schema should be serializable, so it can be easily integrated into the sync protocol.

Option 1 is too lower-level that developers should not concern about, it should probably be removed from the API. And the rule-out schema proposal I posted above is way more counterintuitive to use.

bogobogo commented 2 years ago

Hey @sep2, Thanks for the detailed answer!

I will try to implement Option 2 at some point in the future but it might take some time since I have quite a lot on my plate right now.

I agree with making schemas serialisable, and I don't think it will be an issue. An explicit schema as I defined above could be very easily described in a json like object. One option is that the Map and Array constructs I used above can be simple function helpers that create this json.

In any case, I believe there is an opportunity here to create a solid, production grade solution.

neftaly commented 2 years ago

A third option might be to do data marshalling, ie embedding the type information in the immer data. For example DynamoDB has a special "DynamoDB JSON" which describes Maps, Lists, Strings, Numbers etc. {"a":"b","c":[1,2,3]} converts to

{
  "a": {
    "S": "b"
  },
  "c": {
    "L": [
      {
        "N": "1"
      },
      {
        "N": "2"
      },
      {
        "N": "3"
      }
    ]
  }
}

The pros would be that it's schemaless, cons that drilling down into an object is a bit more verbose (eg object.c.L[0].N vs object.c[0])

edit: This seems to make adding optional schemas even easier too. It should solve #4 out of the box. I will continue down this route, here is my WIP: https://codesandbox.io/s/immer-yjs-marshalled-t4bids

neftaly commented 2 years ago

For the second option, would it be best to use a subset of JSON schema to describe the data shape? It would be great to be able to use existing data valadation libs and typescript-to-json-schema tools.

data = {
  "plainStringField": "I'm a plain string",
  "sharedTextField": "I'm a CRDT string"
}

schema = {
  "type": "object",
  "properties": {
    "plainStringField": { "type": "string" },
    "sharedTextField": { "yType": "YText" }
  }
}

edit: This seems to work OK with some tweaking, I have a WIP test of schema parsing here: https://codesandbox.io/s/immer-yjs-schema-demo-oblgnh

neftaly commented 2 years ago

I have done a bit of work on a marshalled data fork and now both reading and writing work properly, note yarray/ytext/yxml aren't implemented yet. https://codesandbox.io/s/immer-yjs-marshalled-2way-rxsh7r

Since type information is now being stored in immer, it should be possible to automatically generate a schema (as long as type information is provided when data is first added). I am excited about this!

sep2 commented 2 years ago

This is so cool! Sorry, I'm too busy with other things to catch up on this issue. In fact, I encourage forking this library and modifying it as needed, since this is really a small codebase.

neftaly commented 1 year ago

~~Just an update, I am continuing development at https://github.com/probability-nz/y-immutable~~ I've "crossed the fence" and am now working exclusively on getting Automerge to feature parity with Yjs :)

sep2 / immer-yjs