How to implement validation / disaster recovery?

georeith commented 3 years ago

The problem I am facing

I am noticing I sometimes get corrupted documents where certain properties are missing, this is a hard error and results in total loss of work. To combat this I would like to validate my updates being received by clients and optionally do nothing (and tell the sending client to reset their state).

While I would like to fix the underlying issue and not have documents corrupted ever I also would like some resiliency to this as it is somewhat inevitable that bugs like this find their way in.

Given I have a method for validating the JSON representation of my document this is how the flow would work in YJS:

1. Turn GC off on document
1. Take snapshot of document
1. Apply update
1. Validate JSON representation of document
  - 1. If invalid, revert document to snapshot
1. Turn GC back on (sidenote: is there some way to kick off a GC check now as this looks like it may never GC as its always off when the update is applied?)

The solution I would like

I can't see any way of fitting this flow into @hocuspocus/server at the moment as there is no method of listening in to both before and after an update is applied or method of preventing an update being sent to other clients.

Something that does the above would be great but I am open to suggestions on alternatives. The issue I can see with snapshots is I think currently the only methods available are to create a new document out of that snapshot. Which would require replacing the in memory document and telling connected clients to reset their state 🤔

Alternatives I have considered

The above assumes validation on every change, which I don't think should be a problem with a precompiled validation function using something like https://github.com/ajv-validator/ajv. However an alternative flow is to do validation when persisting to long term storage and resetting the server and client state to the last valid one from long term storage when an invalid document is received.

This would look something like:

1. Debounced validation in onChange
  - 1. If validation fails, reset the document and force open clients to do the same.

Currently I am preventing corrupted documents being saved into long term storage but am lacking a way of force restarting the server and currently connected clients.

ViktorQvarfordt commented 2 years ago

I've been thinking about this as well, I was just about to open an issue but then I found this, so I add my notes here to continue the conversation.

The problem I am facing I want to validate changes. For example, maintain valid reference ids or constrain numbers to certain values.

The solution I would like Introduce onValidate hook that runs before onChange. If an error is thrown in that hook, the change is rejected and the client that sent the operation receives a revert operation.

This can be achieved by, before calling onValidate, creating a copy of the document, applying the operation and passing the new version of the copied document to onValidate. If the hook does not throw, then the operation is applied to the real document.

Alternatives I have considered Dealing with invalid state in the application code. For example, normalizing the data such that invalid references are removed or numbers are clamped into valid ranges.

Additional context The validation code used in onValidate should be used on the client before submitting an operation, to avoid that user temporarily being in an invalid state until hocuspocus sends back a revert operation (that is, if the operation yielded an invalid state). Still, the validation needs to be on the backend because one can never trust a client.

ViktorQvarfordt commented 2 years ago

The whole thing with sending a "revert operation" may be (unnecessarily) complex. It would suffice to simply terminate the client connection. The client should be informed of the reason for disconnect so the client can handle the error gracefully (perhaps clearing the local state and reconnecting). This approach guarantees that no invalid operations get transmitted to other clients.

Taking this line of reasoning further: Validating on each change may be very costly if changes come often and the document is large. Instead, we are considering validating just before we persist the state. This operations is reasonably debounced. If a validation happens here the entire ydoc is reverted to the last saved state. This approach allows invalid operations to be transmitted to clients for a short period of time (until the backend validation kicks in and resets the ydoc). This case should be thought of as an edge case that doesn't really need to be optimized further, since one can run the same validation code as the backend on every client 1) before sending data and 2) after receiving data. (Clients may run on different versions, so it is good practice to validate at every step.)

georeith commented 2 years ago

@ViktorQvarfordt agreed around the revert operation I have actually implemented some of this since and have reached the same conclusion that it is not necessary.

I have a custom Redis adapter that wraps y-redis and disables the automatic persistence when the document updates. Instead I build up a queue of updates using async#cargo(), each time it finishes a write it takes the current queue, validates the document, if it passes it merges the updates together and persists them, if it fails it does not allow it into Redis.

Long term storage just validates the document and stores a snapshot of it on a larger interval.

On the client I am using the same validation function but not on each change which is far too expensive for my application (have a target of 60FPS alongside a constant stream of changes) but I do validate before loading out of indexeddb (and if its invalid I discard it and load from the server) or if the application crashes (to detect whether the application crashed because of a corrupt document) in which case I reload the application with the state from the server (after a small delay to let the server drop its connection).

Currently I am not forcefully closing the connections, but that is a next step, if one client errors they all error and the connection closes automatically so am relying on that in the short term.

Problems I'm still facing:

Validation is too expensive to do on every change, at least for me my documents can grow big, and even a 10ms validation can create a bottleneck (multiple users working in realtime). This presents the issue that it isn't possible to validate changes before sending them out to other users without lagging realtime.
Even if your validation is not too expensive to do on every update, there isn't a way of doing it currently before sending it out to other users. As far as I can tell this would require keeping the document twice in memory, loading it into a staging document first, validating that and if it passes loading the update into the main document.

Dealing with invalid state in the application code. For example, normalizing the data such that invalid references are removed or numbers are clamped into valid ranges.

Just a note on this but we have found most of our corruptions to be coming from bugs in Yjs or layers on top and not to do with the data the client itself set. I think even if you implement this you're still susceptible these corruptions.

ViktorQvarfordt commented 2 years ago

Nice, thanks!

Quite worrying to hear about the bugs in Yjs and the layers on top. What have you had the most problems with?

With regards to "validation by normalization", the use case I'm planning for is to have yDoc → toJSON() "raw js object" → normalize() "normalized js object", for the application to consume. And client updates happen similarly: js object → clever recursive patching of the yDoc. In this way we have part of our Redux state entirely real-time synced with very little additional code. Planning to share this code when we've stabilized a little bit, if you're curious to see more of it.

georeith commented 2 years ago

@ViktorQvarfordt currently UndoManager is the biggest cause of issues for me: https://github.com/yjs/yjs/issues/317

But also getting some corruption issues from latest versions of HocusPocus (I'm going to open a ticket about them shortly but they are quite hard to debug).

Sounds interesting would be interested to see it (your state management).

We actually moved from Redux to Mobx (because it works better with mutability but is also a lot faster which matters a lot to our application) and have a Proxy layer on-top of YJS that lets us edit it like a regular JS object (but shares the same memory as the YJS document so no increase in size) and then Mobx bindings that detect reads from the Proxy layer and writes using YJS observe functions, I intend to share that Proxy layer and Mobx binding once we've released too but its working very well for us and is most importantly very fast.

It's actually more of a hybrid approach as we're still using our redux reducers (they just mutate the store instead of doing immutable updates) and store (because I haven't wanted to switch them all over yet) just not using react-redux or any subscriptions on redux itself.

mjamesderocher commented 2 years ago

@georeith I'd be very interested in the proxy layer when you release it!

georeith commented 2 years ago

@mjamesderocher will let you know when it comes out. FYI we aren't using Y.Text only Y.Map and Y.Array but I don't think it would be too hard to extend it to Y.Text if you are using that, it just has a less obvious representation in a JSON like object.

hanspagel commented 2 years ago

@georeith A lot of exciting stuff! Feel free to create the data corruption issue, even if you don’t find time to debug it in detail. Happy to look into that! Got a probably related issue in my inbox already, maybe it’ll help to have more parts of the puzzle.

YousefED commented 1 year ago

@mjamesderocher just stumbled on this thread. I released a similar proxy layer here: https://syncedstore.org/ (specifically in this package: https://github.com/YousefED/SyncedStore/tree/main/packages/yjs-reactive-bindings)

kongweiying2 commented 1 year ago

@YousefED I'm getting this issue as well. This has happened quite often where I'm editing an Extension by say, changing it from a Mark to a Node. From what I understand, every time the editor is loaded, it will actually parse the content and modify it based off the state of the current extensions. Unfortunately, this seems to corrupt the document and there's no way to recover from this.

Have you managed to find a solution to invalid content like this?

janthurau commented 1 year ago

Hi all!

Just wanted to see if this is still current and what would be an expected behavior / feature of Hocuspocus. How exactly do documents get "corrupted"? Is this always connected with Tiptap, or do the yjs docs itself break?

If the issues are related to Tiptap, I'd assume that this has to be solved on the user-side (so if Tiptap<>Hocuspocus, by Tiptap (as the issue is the same if you just store the tiptap json a regular database), see also https://github.com/ueberdosis/tiptap/issues/3437.

We could support on Hocuspocus side by adding a feature that requires a specific client version before allowing a connection, but this is also something that can be easily implemented already using an onConnect or onAuthenticate hook.

LMK what you think, if this should be solved on user-side, I'll close this.

kongweiying2 commented 1 year ago

@janthurau

I don't think this should be solved user-side, it should be a function of the library (Hocuspocus and Collab). So for example, the two main problem areas are:

Preventing the document from getting into a bad state by checking before invalid changes are merged. This happens when two editors with the same node extension type, but different implementations of that same type, sync with each other. What should happen is that changes should be reverted to the previous valid state.
If the document has invalid data for nodes (because maybe the document is malformed, or the current extension for that node is incompatible with the data in the doc which was based off an older extension), then gracefully catching these errors and giving an interface to take that data and produce a new valid node. Currently, TipTap seems to eat up the old node irreversibly such that even if you revert to an older version of the extension, the node still doesn't show. So this is functionally the same as data corruption. Until there's a version history extension to be able to revert to older document states, your old data is essentially unaccessible (even though technically it's still there in the form of the binary deltas).

e.g.

data_t1 extension_t1 Everything works fine data_t1 extension_t2 Extension to handle this node was updated, doesn't handle the data at t1 data_t2 extension_t2 Extension incorrectly parses data_t1 into data_t2 and data_t2 is unrevertable

Note that the node associated with data_t1 seems to be permanently inaccessible, even if we revert to the extension at t1. I originally thought that TipTap would just skip over the invalid data but it doesn't seem to do that. It seems like it attempts to parse the data but ends up corrupting it and excluding it from the document entirely.

I don't believe this is a y.js problem. Rather, the syncing is perfect, but the issue is that TipTap doesn't handle the evolution of existing nodes gracefully. When TipTap parses these nodes, it'll do something dodgy and then perfectly propagate the corrupted version of the file through Collab.

timoisik commented 1 year ago

@kongweiying2, thanks for your thoughts on this.

What you describe mainly concerns your very individual editor setup with your custom nodes. Hocuspocus does not know how your editor setup looks like and therefore cannot check if it is corrupt.

If, then hocuspocus will only offer a general solution, as described above, an onValidate hook, or as described by Jan, a version field.

But as Jan also described, this could be caught already in the onConnect hook. An individual editor schema has to be validated client-side. Especially since Hocuspocus is not a tiptap editor-only collaboration backend. :)

Open for more suggestions.

julianpoy commented 2 months ago

Would love to see an onValidate hook added for this. I'd find it useful to be able to at least terminate connections if my own custom validation logic within Hocuspocus determines that a state update is invalid before persisting/sharing that update to other clients.

ueberdosis / hocuspocus

How to implement validation / disaster recovery? #135