Applying ops under load

yaroslavputria commented 2 years ago

Hi Guys!

About the project: it's a quill collaborative rich text editor. On the back-end side, there is a load balancer over 5 instances of sharedb connected through redis. Sharedb-mongo - is the db adapter.

I'm doing a kind of load testing. I'm running end-to-end tests which simulate users editing - each user produces changes with some frequency. I'm limited in resources so I'm able to run about 50 chromedriver instances (so 50 users). As I have 5 instances of sharedb I expect 5 users per document working concurrently. So these tests simulate 50 users typing within 10 documents. The consumed CPU and RAM look good. When I noticed that under such a load, synchronizing users takes time, I added logs to trace time of the applying op. And I see that applying an op mostly takes milliseconds. BUT sometimes it takes even more than 10 seconds - it's a case when "commit" is unsuccessful and sharedb has to re-fetch snapshot/ops (to re-transform the op). Most of this time is spent on accessing mongo.

Is there anything that I can improve? Or is it a limit? I use only one mongo collection - will using multiple collections improve the experience?

Applying op timing: `

+++ getSnapshot before 0
### collection is get 0
### beforeSnapshotLookup middlewate 0
### snapshot is fetched 0.001
+++ getSnapshot after 0.001
~~~ apply 0.001
~~~ commit 0.001
~~~ afterWrite 0.004

+++ getSnapshot before 0
### collection is get 0
### beforeSnapshotLookup middlewate 0
### snapshot is fetched 0.001
+++ getSnapshot after 0.001
~~~ apply 0.001
~~~ commit 0.001
~~~ afterWrite 0.003


+++ getSnapshot before 0
### collection is get 0
### beforeSnapshotLookup middlewate 0
### snapshot is fetched 0.851
+++ getSnapshot after 0.851
+++ getOpsToSnapshot before 0.851
### OP collection is get 0.851
### ops are fetched 1.774
+++ getOpsToSnapshot after 1.774
+++ ops transformed and will be applied 1.774
~~~ apply 1.774
~~~ commit 1.774
+++ unsuccessful commit -> do retry
+++ getSnapshot before 2.15
### collection is get 2.15
### beforeSnapshotLookup middlewate 2.15
### snapshot is fetched 2.152
+++ getSnapshot after 2.152
+++ getOpsToSnapshot before 2.152
### OP collection is get 2.152
### ops are fetched 2.154
+++ getOpsToSnapshot after 2.154
+++ ops transformed and will be applied 2.154
~~~ apply 2.154
~~~ commit 2.154
~~~ afterWrite 2.241`

alecgibson commented 2 years ago

Thanks for the work you've done so far! Do you happen to have the MongoDB logs? They should contain the queries being made, as well as how long those queries are taking; it'd really help to narrow down if the issue is a slow query, or with sharedb-mongo (or both!).

ericyhwang commented 2 years ago

Yeah, thanks for providing the detailed timing logs, those are illustrative!

On the first attempt, the two long delays are:

### beforeSnapshotLookup middlewate 0
### snapshot is fetched 0.851
...
### OP collection is get 0.851
### ops are fetched 1.774

Which are indeed both in sharedb-mongo + Mongo.

As Alec mentions, the slowness could be coming from:

MongoDB itself being slow at fulfilling the read requests for the snapshot and ops
Or the Node process may be blocked on something, perhaps on processing other requests

It would be interesting to take a performance profile of, say, 2 Node processes handling requests from 20 clients, to see if the bottleneck is Node process CPU, time spent waiting on Mongo, or something else entirely.

With just 2 Node processes, it should be feasible to run them both with --inspect, attach Chrome Dev Tools to both, and run the Performance profiler on both while running the load test. Or use your other favorite profiling tool.

yaroslavputria commented 2 years ago

@alecgibson @ericyhwang thanks for the hints - I will collect more info.

yaroslavputria commented 2 years ago

I haven't done profiling yet - it requires more effort to hook into our Kubernetes infrastructure... But I'll do it.

Meanwhile, I set the mongo log level to debug. Also, I added a log wrapped in nextTick before fetching ops/snapshot - to check if nodejs is overloaded. So, in the case when fetching ops/snapshot takes too much time, my nextTick log is called just after a few mongo logs (nodejs should not be overloaded) but there are a lot of calls (sometimes hundreds) to mongo before the ops/snapshot is really fetched.

Adapting the example above:

### beforeSnapshotLookup middlewate 0
// here, mongo logs a lot of queries
### snapshot is fetched 0.851 // looks like this fetch is queued

Attaching new logs with my comments (separated with //): quick.log slow.log

ericyhwang commented 2 years ago

Based on that, it currently seems like the slowness is due to Mongo time and not Node time, so it's probably best to look further at Mongo for now instead of trying to get Node profiling working in Kubernetes.

The first slow part of the slow log has a bunch of o_NOTES_EDITOR_SHAREDB queries. One thing to check - Is there an appropriate index { d: 1, v: 1 } on o_NOTES_EDITOR_SHAREDB? If not, those would take a long time in Mongo and could slow down other queries.

Later on, there are also multiple queries against the NOTES_EDITOR_SHAREDB snapshot collection, which should be fast since they're id lookups. But if there are enough happening at once, that could slow things down. Or it could be residual processing on the op queries.

yaroslavputria commented 2 years ago

Hi @ericyhwang,

One thing to check - Is there an appropriate index { d: 1, v: 1 } on o_NOTES_EDITOR_SHAREDB?

Thank you for trying to guide me, but I didn't get this - how to check this?

But if there are enough happening at once, that could slow things down. Or it could be residual processing on the op queries.

This is also unclear for me. What is "enough" in this context? I am actually doing this "load testing" to understand what I need to allow up to 500 users to work concurrently (up to 10 users per document). And at some point (~ 5 users per doc) scaling sharedb stopped improving the experience.

alecgibson commented 2 years ago

how to check this?

https://docs.mongodb.com/manual/reference/method/db.collection.getIndexes/

yaroslavputria commented 2 years ago

🤦, thanks @alecgibson.

Documents collection indexes:

[
   {
      "v":2,
      "key":{
         "_id":1
      },
      "name":"_id_",
      "ns":"yyp_ca_notes_editor.NOTES_EDITOR_SHAREDB"
   }
]

Ops collection indexes:

[
   {
      "v":2,
      "key":{
         "_id":1
      },
      "name":"_id_",
      "ns":"yyp_ca_notes_editor.o_NOTES_EDITOR_SHAREDB"
   },
   {
      "v":2,
      "key":{
         "d":1,
         "v":1
      },
      "name":"d_1_v_1",
      "ns":"yyp_ca_notes_editor.o_NOTES_EDITOR_SHAREDB",
      "background":true
   },
   {
      "v":2,
      "key":{
         "src":1,
         "seq":1,
         "v":1
      },
      "name":"src_1_seq_1_v_1",
      "ns":"yyp_ca_notes_editor.o_NOTES_EDITOR_SHAREDB",
      "background":true
   }
]

alecgibson commented 2 years ago

Out of curiosity, have you got getOpsWithoutStrictLinking: true?

alecgibson commented 2 years ago

Also do you have access to the MongoDB logs from the mongod process itself? By default I think this lives at /var/log/mongodb/mongodb.log and it might contain a bit more detailed information, such as how long each query is taking.

yaroslavputria commented 2 years ago

Out of curiosity, have you got getOpsWithoutStrictLinking: true?

Yes, before I created this issue I had tried getOpsWithoutStrictLinking.

MongoDB logs from the mongod process

I'll try to collect them.

yaroslavputria commented 2 years ago

I added more logs to figure out what is that queries queue which delays getSnapshot during a slow op - now I see that it's because of broadcasting presence data. That bunch of the same queries is a set of the getSnapshotOpLink requests (getSnapshotOpLink is a part of transformPresenceToLatestVersion). So client(s) produce this load by sharing presence data - will look more into this.

See the attached log file (it has more stacktrace information). slow.log

yaroslavputria commented 2 years ago

Without presence it's faster than light.

alecgibson commented 2 years ago

@yaroslavputria thanks for all the digging you've done on this!

Here are some of my initial thoughts:

We definitely don't want Presence bogging down "real" ShareDB work. In theory, Presence itself is OT, so should be able to work at a similar performance level.

I think the issue here is a combination of:

Presence's over-reliance on db.getOps()
The performance of db.getOps() itself?

Let's compare with SubmitRequest.submit(). In SubmitRequest.submit(), we:

fetch the snapshot
only fetch ops if the snapshot has changed underneath us

This means that in the "happy" case (clients don't submit concurrent ops), we never have to fetch any ops, and we reuse the snapshot that we already had.

Contrast this with Agent._broadcastPresence:

Even in the happy case, we call getOps() every time there's a Presence update.

There's possibly some trickery we can do here to re-optimise for the "happy" case, because we transform Presence both server-side and client-side.

If you turned off server-side transformations, this would massively reduce the calls to the Database, because we'd only ever call it if the client needs to catch up some presence. This would come at the cost of some potential lag introduced into the Presence system where ops collide with Presence.

I've opened https://github.com/share/sharedb/pull/538 to continue discussion down that thread.

@yaroslavputria would you be able to please re-run your performance test using that branch?

yaroslavputria commented 2 years ago

Hi @alecgibson, Yes, I'm going to re-run my tests using that branch. Thanks!

share / sharedb-mongo

Applying ops under load #123