About the actual storage size of documents during collaborative editing

ueberdosis / hocuspocus

The CRDT Yjs WebSocket backend for conflict-free real-time collaboration in your app.

https://tiptap.dev/docs/hocuspocus/introduction

MIT License

1.19k stars 115 forks source link

About the actual storage size of documents during collaborative editing #805

Closed HengCC closed 4 months ago

HengCC commented 5 months ago

In the existing collaborative environment, documents are stored as binary strings in the database. But I don't know why. When the number of collaborators grows. Even if the document has not actually changed. This binary document also keeps getting bigger in some way. And increasingly uncontrollable. For example, the document in the screenshot below. The text format of the original content is about 5KB. However, after many people edited at the same time, the original content did not change. But the binary content is a staggering 15MB. This is obviously disastrous. As a result, the document takes longer to load, and the size will continue to increase. Is there any way to avoid such a meaningless increase in size?

HengCC commented 5 months ago

I store and read like this:

read:

 if (result.content && result.content !== "") {
              return Promise.resolve(Buffer.from(result.content, 'binary'));
            } else {
              return Promise.resolve(null);
            }

store:

content=  state.toString("binary")

janthurau commented 4 months ago

hey @HengCC, the data is stored in a binary yjs format, which is highly efficient and really fast. Yjs has to track history of all changes that any user has done, which is why the document naturally gets bigger over time. Without looking at your yjs document or knowing how exactly you're doing changes, it's impossible to know what causes your huge document, but this definitely should not happen.

Have you maybe turned off garbage collection (https://docs.yjs.dev/api/y.doc)?

HengCC commented 4 months ago

@janthurau Thanks for your reply. I just log the gc configuration. The default is true, I haven't changed it, and there seems to be no good way to know what is growing. Are there any tools that can analyze YJS documents? So we can see what is taking up so much space

nperez0111 commented 4 months ago

@janthurau Thanks for your reply. I just log the gc configuration. The default is true, I haven't changed it, and there seems to be no good way to know what is growing. Are there any tools that can analyze YJS documents? So we can see what is taking up so much space

@HengCC, You can load it into the new Yjs Playground

HengCC commented 4 months ago

@nperez0111 Thanks, using this tool I analyzed the stored data, as mentioned above, the actual content is not large, but I found that there are a lot of clients in yjs doc, this amount of data is amazing. I'm thinking of ways to eliminate it.

huanghantao commented 4 months ago

Thanks, using this tool I analyzed the stored data, as mentioned above, the actual content is not large, but I found that there are a lot of clients in yjs doc, this amount of data is amazing. I'm thinking of ways to eliminate it.

Hello, is there any good way to deal with these clients?

HengCC commented 3 months ago

@huanghantao I don't have a good way to deal with it right now. You can go to the YJS community and ask. But I'm trying a possible solution. This is provided that you allow the history of these clients to be discarded. In my scene. All I really want is a final copy of the document. The process of collaboration is not really matter. I just need to make a regular copy backup. I'm going to construct a new ydoc before persisting, and then merge the currentState of the current ydoc. However, I am not sure whether the client information can be discarded in this way. You can also try it.

georeith commented 3 months ago

Just weighing in on how I deal with this. We store both the Yjs CRDT and a JSON snapshot of the data at the tip of that.

After a certain period of inactivity we archive the Yjs CRDT, expiring it. The next time its requested we'll create a fresh CRDT from the JSON snapshot of the data (with no history).

We store a generation number in the CRDT that gets bumped everytime it's recreated. The client sends this generation number when connecting. If their generation does not match the copy on the server (or there is none on the server because it expired), the client is told to discard their local copy of the CRDT and resync with the server with an empty document.

The downside is that anyone who has unsynced offline changes prior to the point of expiry will lose those changes. They must discard them, they cannot be synced. We think this is a fair trade off and is why we expire only after a long enough period of inactivity.