Open yaroslavputria opened 2 years ago
Thanks for the work you've done so far! Do you happen to have the MongoDB logs? They should contain the queries being made, as well as how long those queries are taking; it'd really help to narrow down if the issue is a slow query, or with sharedb-mongo
(or both!).
Yeah, thanks for providing the detailed timing logs, those are illustrative!
On the first attempt, the two long delays are:
### beforeSnapshotLookup middlewate 0
### snapshot is fetched 0.851
...
### OP collection is get 0.851
### ops are fetched 1.774
Which are indeed both in sharedb-mongo + Mongo.
As Alec mentions, the slowness could be coming from:
It would be interesting to take a performance profile of, say, 2 Node processes handling requests from 20 clients, to see if the bottleneck is Node process CPU, time spent waiting on Mongo, or something else entirely.
With just 2 Node processes, it should be feasible to run them both with --inspect
, attach Chrome Dev Tools to both, and run the Performance profiler on both while running the load test. Or use your other favorite profiling tool.
@alecgibson @ericyhwang thanks for the hints - I will collect more info.
I haven't done profiling yet - it requires more effort to hook into our Kubernetes infrastructure... But I'll do it.
Meanwhile, I set the mongo log level to debug. Also, I added a log wrapped in nextTick before fetching ops/snapshot - to check if nodejs is overloaded. So, in the case when fetching ops/snapshot takes too much time, my nextTick log is called just after a few mongo logs (nodejs should not be overloaded) but there are a lot of calls (sometimes hundreds) to mongo before the ops/snapshot is really fetched.
Adapting the example above:
### beforeSnapshotLookup middlewate 0
// here, mongo logs a lot of queries
### snapshot is fetched 0.851 // looks like this fetch is queued
Attaching new logs with my comments (separated with //): quick.log slow.log
Based on that, it currently seems like the slowness is due to Mongo time and not Node time, so it's probably best to look further at Mongo for now instead of trying to get Node profiling working in Kubernetes.
The first slow part of the slow log has a bunch of o_NOTES_EDITOR_SHAREDB
queries. One thing to check - Is there an appropriate index { d: 1, v: 1 }
on o_NOTES_EDITOR_SHAREDB
? If not, those would take a long time in Mongo and could slow down other queries.
Later on, there are also multiple queries against the NOTES_EDITOR_SHAREDB
snapshot collection, which should be fast since they're id lookups. But if there are enough happening at once, that could slow things down. Or it could be residual processing on the op queries.
Hi @ericyhwang,
One thing to check - Is there an appropriate index
{ d: 1, v: 1 }
ono_NOTES_EDITOR_SHAREDB
?
Thank you for trying to guide me, but I didn't get this - how to check this?
But if there are enough happening at once, that could slow things down. Or it could be residual processing on the op queries.
This is also unclear for me. What is "enough" in this context? I am actually doing this "load testing" to understand what I need to allow up to 500 users to work concurrently (up to 10 users per document). And at some point (~ 5 users per doc) scaling sharedb stopped improving the experience.
how to check this?
https://docs.mongodb.com/manual/reference/method/db.collection.getIndexes/
🤦, thanks @alecgibson.
Documents collection indexes:
[
{
"v":2,
"key":{
"_id":1
},
"name":"_id_",
"ns":"yyp_ca_notes_editor.NOTES_EDITOR_SHAREDB"
}
]
Ops collection indexes:
[
{
"v":2,
"key":{
"_id":1
},
"name":"_id_",
"ns":"yyp_ca_notes_editor.o_NOTES_EDITOR_SHAREDB"
},
{
"v":2,
"key":{
"d":1,
"v":1
},
"name":"d_1_v_1",
"ns":"yyp_ca_notes_editor.o_NOTES_EDITOR_SHAREDB",
"background":true
},
{
"v":2,
"key":{
"src":1,
"seq":1,
"v":1
},
"name":"src_1_seq_1_v_1",
"ns":"yyp_ca_notes_editor.o_NOTES_EDITOR_SHAREDB",
"background":true
}
]
Out of curiosity, have you got getOpsWithoutStrictLinking: true
?
Also do you have access to the MongoDB logs from the mongod
process itself? By default I think this lives at /var/log/mongodb/mongodb.log
and it might contain a bit more detailed information, such as how long each query is taking.
Out of curiosity, have you got
getOpsWithoutStrictLinking: true
?
Yes, before I created this issue I had tried getOpsWithoutStrictLinking
.
MongoDB logs from the
mongod
process
I'll try to collect them.
I added more logs to figure out what is that queries queue which delays getSnapshot during a slow op - now I see that it's because of broadcasting presence data. That bunch of the same queries is a set of the getSnapshotOpLink requests (getSnapshotOpLink is a part of transformPresenceToLatestVersion). So client(s) produce this load by sharing presence data - will look more into this.
See the attached log file (it has more stacktrace information). slow.log
Without presence it's faster than light.
@yaroslavputria thanks for all the digging you've done on this!
Here are some of my initial thoughts:
We definitely don't want Presence bogging down "real" ShareDB work. In theory, Presence itself is OT, so should be able to work at a similar performance level.
I think the issue here is a combination of:
db.getOps()
db.getOps()
itself?Let's compare with SubmitRequest.submit()
. In SubmitRequest.submit()
, we:
This means that in the "happy" case (clients don't submit concurrent ops), we never have to fetch any ops, and we reuse the snapshot that we already had.
Contrast this with Agent._broadcastPresence
:
Even in the happy case, we call getOps()
every time there's a Presence update.
There's possibly some trickery we can do here to re-optimise for the "happy" case, because we transform Presence both server-side and client-side.
If you turned off server-side transformations, this would massively reduce the calls to the Database, because we'd only ever call it if the client needs to catch up some presence. This would come at the cost of some potential lag introduced into the Presence system where ops collide with Presence.
I've opened https://github.com/share/sharedb/pull/538 to continue discussion down that thread.
@yaroslavputria would you be able to please re-run your performance test using that branch?
Hi @alecgibson, Yes, I'm going to re-run my tests using that branch. Thanks!
Hi Guys!
About the project: it's a quill collaborative rich text editor. On the back-end side, there is a load balancer over 5 instances of sharedb connected through redis. Sharedb-mongo - is the db adapter.
I'm doing a kind of load testing. I'm running end-to-end tests which simulate users editing - each user produces changes with some frequency. I'm limited in resources so I'm able to run about 50 chromedriver instances (so 50 users). As I have 5 instances of sharedb I expect 5 users per document working concurrently. So these tests simulate 50 users typing within 10 documents. The consumed CPU and RAM look good. When I noticed that under such a load, synchronizing users takes time, I added logs to trace time of the applying op. And I see that applying an op mostly takes milliseconds. BUT sometimes it takes even more than 10 seconds - it's a case when "commit" is unsuccessful and sharedb has to re-fetch snapshot/ops (to re-transform the op). Most of this time is spent on accessing mongo.
Is there anything that I can improve? Or is it a limit? I use only one mongo collection - will using multiple collections improve the experience?
Applying op timing: `