Open janmonschke opened 9 years ago
Is this about something other than https://github.com/janmonschke/diffsync#dataadapter?
My thoughts:
My thoughts about your thoughts: :thought_balloon: :cyclone: :thought_balloon:
more
than that. It is about how the internal sync documents are handled. But yeah, the API is basically the same with the addition of a removeData
method that is used for scenarios when a client disconnects and the shadow documents are not needed anymore.Ouff, just saw that I did not implement the fetching of client shadow documents asynchronously :smile: But I should have done it like that in the first place ;)
Oh no, running into a bigger problem here. Let's say that clients can reside on arbitrary nodes and those nodes take care of fetching the correct master document and the correct shadow documents for each client.
This leads to the problem that for each sync request, each node has to make up to four DB requests:
These requests can be reduced to two requests if shadow documents were embedded inside their master documents (which would be easy for the case of schema-free databases).
But the biggest culprit would be that the database had to lock the document from step 1 to step 4 and could only release it afterwards. Other nodes could attempt to write to the same document in the meantime which would result in dirty reads and loss of data when writing.
Am I right with my assumptions? Or do I oversee a very simple solution on how to scale this to more than one node without having a load-balancer in place that gathers clients working on the same document on the same nodes.
@episodeyang How did you handle this problem?
It seems like Redis would be a good fit for storing the master and shadow, maybe paired with node-redlock.
Another alternative is to have the servers share the objects directly with each other. It seemed that Neil Fraser was more intrigued by this idea in his talk.
@janmonschke I'm interested in working on this problem so please let me know your thoughts on those two options.
@winton thx for you input :)
I definitely prefer option #2, which would still not make it stateless, but I also think this is the way Neil Fraser was advocating. Happy to follow your work on that!
At the moment, diffsync stores all user documents that are necessary for the sync-cycle in memory. This has two implications:
In my opinion, the users' documents should be kept outside of the diffsync node e.g. in a redis instance.
Luckily, diffsync internally already reads data via an asynchronous interface so that the code changes should actually be pretty minimal. Regarding testing it should also not be too hard.
The main implication for the user would be an elevated sync time which depends on the type of data store and the distribution of the application's parts (node, intermediate data and permanent data). I guess it is okay to have this overhead in favour of getting rid of this extra state.
What do you think?