janmonschke / diffsync

Enables real-time collaborative editing of arbitrary JSON objects
MIT License
222 stars 23 forks source link

diffsync should be stateless #2

Open janmonschke opened 9 years ago

janmonschke commented 9 years ago

At the moment, diffsync stores all user documents that are necessary for the sync-cycle in memory. This has two implications:

  1. Applications with a lot of peers consume a lot of memory and will eventually run out of memory at some point
  2. Scaling diffsync to more than one node is pretty much impossible unless there is application logic in place that routes peers that work on the same document onto the same note. Depending on the applications architecture, this is quite hard to achieve. Also, holding this state in memory violates best-practices of many hosting platforms, such as e.g. Heroku.

In my opinion, the users' documents should be kept outside of the diffsync node e.g. in a redis instance.

Luckily, diffsync internally already reads data via an asynchronous interface so that the code changes should actually be pretty minimal. Regarding testing it should also not be too hard.

The main implication for the user would be an elevated sync time which depends on the type of data store and the distribution of the application's parts (node, intermediate data and permanent data). I guess it is okay to have this overhead in favour of getting rid of this extra state.

What do you think?

seidtgeist commented 9 years ago

Is this about something other than https://github.com/janmonschke/diffsync#dataadapter?

My thoughts:

  1. diffsync should always ship with a in-memory implementation by default so people can play with it
  2. Could multiple diffsync servers let users work on the same document if there's a shared representation between servers? What then is the difference between clients and servers?
janmonschke commented 9 years ago

My thoughts about your thoughts: :thought_balloon: :cyclone: :thought_balloon:

  1. Yes, it's about more than that. It is about how the internal sync documents are handled. But yeah, the API is basically the same with the addition of a removeData method that is used for scenarios when a client disconnects and the shadow documents are not needed anymore.
  2. Yes, the in-memory implementation would still be in there for exactly the reason that you mentioned. Much like express is handling sessions.
  3. Indeed, multiple diffsync servers could handle the clients for the same document and it is one step further into allowing clients behave exactly as servers.
janmonschke commented 9 years ago

Ouff, just saw that I did not implement the fetching of client shadow documents asynchronously :smile: But I should have done it like that in the first place ;)

janmonschke commented 9 years ago

Oh no, running into a bigger problem here. Let's say that clients can reside on arbitrary nodes and those nodes take care of fetching the correct master document and the correct shadow documents for each client.

This leads to the problem that for each sync request, each node has to make up to four DB requests:

  1. Get the master document
  2. Get the shadow documents
  3. Write the shadow documents
  4. Write the new master document

These requests can be reduced to two requests if shadow documents were embedded inside their master documents (which would be easy for the case of schema-free databases).

But the biggest culprit would be that the database had to lock the document from step 1 to step 4 and could only release it afterwards. Other nodes could attempt to write to the same document in the meantime which would result in dirty reads and loss of data when writing.

Am I right with my assumptions? Or do I oversee a very simple solution on how to scale this to more than one node without having a load-balancer in place that gathers clients working on the same document on the same nodes.

@episodeyang How did you handle this problem?

winton commented 9 years ago

It seems like Redis would be a good fit for storing the master and shadow, maybe paired with node-redlock.

Another alternative is to have the servers share the objects directly with each other. It seemed that Neil Fraser was more intrigued by this idea in his talk.

@janmonschke I'm interested in working on this problem so please let me know your thoughts on those two options.

janmonschke commented 9 years ago

@winton thx for you input :)

I definitely prefer option #2, which would still not make it stateless, but I also think this is the way Neil Fraser was advocating. Happy to follow your work on that!