Performance considerations for large number of users with multiple replications for each

tudordumitriu commented 6 years ago

Hi We have the following setup: one db per user, every user db replicates to the central db certain docs, and docs sharing is done from central db to user db using other replication documents (basically each user selects what he wants to share with a list of friends). Our Spiegel docs per user look like (central_to_org.couchdb.user:a - with filters, user_to_central_org.couchdb.user:a - with other filters, and user_to_commands_org.couchdb.user:a) So 3 active replications per user, so couldn't have done it without Spiegel!!

I have enabled detailed logs and seems to be working great, but a major performance questions arises. Now, we only have 10 users in our dev dbs, but I see there are replications carried on (maybe these are just check, but still this is a CouchDB call) for almost corresponding rule (even if there is nothing to replicate for that rule), so for 100 000 users this is a bit worrying since Spiegel main purpose is scalability.

I haven't actually checked the calls with Fiddler or other network sniffer, but from the logs it seems quite a lot o checks, so even a get is still an REST API HIT and it counts, right?

redgeoff commented 6 years ago

Hey @tudordumitriu! It is a little hard to know exactly how your setup is working without knowing more details like the exact contents of your replicator docs. If you can provide them then that should help.

From what you have described it sounds like each time the user's DB changes Spiegel will detect this change via the _global_changes feed (a single continuous stream from CouchDB for each UpdateListener) and then issue a replication from the user's DB to the central DB. And then, if there is a change to the central DB it will be detected via the _global_changes feed and a replication will be scheduled from the central DB to the user's DB. So, if you are seeing more replications than this then it sounds like something is not as expected.

Another thing to mention is that Spiegel maintains its own DB and its DB can receive various API calls as it works it's magic. Spiegel uses batch processing to reduce the number of calls, but this is just part of how it maintains a global state across a distributed system. (Conceptually, some if not all of this state could be moved to something more efficient like Redis, but there is currently no support for this)

One thing that could help is to isolate as many variables as possible, e.g. start with a single user and a single replicator doc and monitor the requests. Then, add the next replicator doc and repeat the process.

tudordumitriu commented 6 years ago

Hey Thanks again for you quick answer and here are few more details

Here is a screenshoot with the logs that I am looking at (again maybe is just me overreacting): https://www.screencast.com/t/nik5HfzA9lvX
Here are the replicator docs for a specific user: { "_id": "central_to_org.couchdb.user:a", "_rev": "4501-72cb5269de497882c20ff124966d1ef7", "$doctype": "spiegelEntity", "type": "replicator", "source": "http://sa@localhost:5984/central", "target": "http://sa@localhost:5984/userdb-61", "filter": "centralReplication/userFilter", "query_params": { "sharedTo": "org.couchdb.user:a", "friends": [ "org.couchdb.user:t" ] }, "dirty": false, "updated_at": "2018-07-04T03:27:24.448Z", "locked_at": null } { "_id": "user_to_central_org.couchdb.user:a", "_rev": "3005-c94fe0b27e9aebe07ee9ddb243f3aa48", "type": "replicator", "source": "http://sa@localhost:5984/userdb-61", "target": "http://sa@localhost:5984/central", "filter": "userReplication/centralFilter", "dirty": false, "locked_at": null, "updated_at": "2018-07-04T03:27:22.400Z" }

{ "_id": "user_to_commands_org.couchdb.user:a", "_rev": "3014-6d291d232fb9a041ff586d3fd31861cc", "type": "replicator", "source": "http://sa@localhost:5984/userdb-61", "target": "http://sa@localhost:5984/commands", "filter": "userReplication/commandsFilter", "dirty": false, "locked_at": null, "updated_at": "2018-07-04T03:27:21.464Z" }

A document looks like { "_id": "t_ebb86ea71da34e06bcc6fed7b903c430_org.couchdb.user:a", "type": "taskEntity", "createdBy": "org.couchdb.user:a", "updatedBy": "org.couchdb.user:a", "listId": "l_todo_org.couchdb.user:a", "name": "ap6", "tags": [ "vacation", "work" ], "status": "Todo", "timezoneDiff": 0, "writeUsers": [ "org.couchdb.user:a" ], "createdAt": "2018-07-04T05:53:42.270+03:00", "updatedAt": "2018-07-04T05:59:56.864+03:00" }
Central replication: { "_id": "_design/centralReplication", "_rev": "9-047ccd149d9c58b8363dfd62dac244d7", "filters": { "userFilter": "function(doc, req){ \r\n if( (req.query.sharedTo && ((doc.writeUsers && doc.writeUsers.indexOf(req.query.sharedTo) >= 0) || (doc.readUsers && doc.readUsers.indexOf(req.query.sharedTo) >= 0))) && ((req.query.sharedTo == doc.createdBy) || (req.query.friends && req.query.friends.indexOf(doc.createdBy) >= 0)) ) return true; else return false; \r\n}" } }

redgeoff commented 6 years ago

I believe the issue is that Spiegel performs all the replications defined by the replicator docs whenever the target DB changes, regardless of whether the change is for a particular filter/view. In your case this means that when the central DB changes then replications occur for all the user DBs.

I think an alternate design could instead replicate between user DBs, which would mean that data is only replicated when a user’s data changes. You would still end up triggering all the replications from one user to other users though whenever the source user’s DB changes.

redgeoff commented 6 years ago

I believe that Spiegel could also be modified to consider filters/views before replicating. This would work by taking the filter/view function in the design doc and then applying it to the change to see if it applies. There may be a performance trade off with this though as the processing would become more involved

redgeoff commented 6 years ago

And just thinking off the top of my head, another possible enhancement to Spiegel is to define “dynamic” replicator docs that would take the values in the changed doc and then determine which replication to perform. So, you could define a single replicator doc (or really on_change doc) for the central DB that would dynamically perform all the replications to the appropriate user DBs. This is essentially another usage of the enhancement discussed at https://github.com/redgeoff/spiegel/issues/52.

One way of doing this without making any changes to Spiegel is to define an _onchange doc that calls a user-defined API to perform the replications. In other words, you would create an API endpoint that would perform the replications from the central DB and offload the dynamic replication to your API.

tudordumitriu commented 6 years ago

Thanks @redgeoff, really appreciate it We will definitely try the on_change / dynamic replication API solution, just not now, since we need to focus on features but surely will keep you posted in couple of months, when we'll implement it.

tudordumitriu commented 6 years ago

Actually, related to the on_change solution.

In on_change we would receive the doc and we would have to load the corresponding rules from Spiegel (we still need to load the allowed friends rules, but we would do it in a performance friendly way, loading only the corresponding rules, so here's the catch) and we would carry on the exact replications that we want to, right?
Would it be failsafe from Spiegel changes service? Or we have to make sure that in our API

redgeoff commented 6 years ago

I'm not sure I understand your question. I'd suspect that you'd want to use the $changes construct which will allow you to send the entire change doc to your API. Then, your API should be able to determine which replications are needed and make them.
The changes are almost failsafe. CouchDB does not guarantee that the _global_changes feed will receive all changes as I believe there are things like load that can affect this, however in my experience, it is extremely rare that a change is missed--I'm not sure I've ever seen it. The actual doc changes (and their details) are retrieved using the _changes feed on a DB, which is completely failsafe. Of course, an alternative is to create a custom API endpoint that perform the replications for you and I think this is also a good choice provided you don't need to support an offline-first design.

tudordumitriu commented 6 years ago

Exactly, my bad expressing it, indeed there lies the optimization
Unfortunately we do need to support offline first - we started with that..well..first :)

Thanks for now, we'll see how things will work out

redgeoff / spiegel

Performance considerations for large number of users with multiple replications for each #138