Closed johnml1135 closed 1 year ago
I restarted Mongo and got this:
why?
Here are the "really slow" messages that are not running after I restarted mongo:
{
"log": "{\"t\":{\"$date\":\"2023-08-11T20:06:54.292+00:00\"},\"s\":\"I\", \"c\":\"COMMAND\", \"id\":51803, \"ctx\":\"conn16\",\"msg\":\"Slow query\",\"attr\":{\"type\":\"command\",\"ns\":\"machine_jobs.hangfire.jobGraph\",\"command\":{\"findAndModify\":\"hangfire.jobGraph\",\"query\":{\"$and\":[{\"Queue\":\"smt_transfer\"},{\"_t\":\"JobDto\"},{\"FetchedAt\":null}]},\"update\":{\"$set\":{\"FetchedAt\":{\"$date\":\"2023-08-11T20:06:51.404Z\"}}},\"new\":true,\"txnNumber\":1,\"$db\":\"machine_jobs\",\"lsid\":{\"id\":{\"$uuid\":\"0b474ddb-e39b-40f6-a47c-bb5b7ca4b4dd\"}},\"$clusterTime\":{\"clusterTime\":{\"$timestamp\":{\"t\":1691784357,\"i\":5}},\"signature\":{\"hash\":{\"$binary\":{\"base64\":\"AAAAAAAAAAAAAAAAAAAAAAAAAAA=\",\"subType\":\"0\"}},\"keyId\":0}}},\"planSummary\":\"IXSCAN { Queue: -1 }\",\"keysExamined\":0,\"docsExamined\":0,\"fromMultiPlanner\":true,\"nMatched\":0,\"nModified\":0,\"nUpserted\":0,\"numYields\":2,\"queryHash\":\"88F449FE\",\"planCacheKey\":\"04E29608\",\"reslen\":217,\"locks\":{\"ParallelBatchWriterMode\":{\"acquireCount\":{\"r\":4}},\"FeatureCompatibilityVersion\":{\"acquireCount\":{\"r\":5,\"w\":3}},\"ReplicationStateTransition\":{\"acquireCount\":{\"w\":7}},\"Global\":{\"acquireCount\":{\"r\":5,\"w\":3}},\"Database\":{\"acquireCount\":{\"w\":3}},\"Collection\":{\"acquireCount\":{\"w\":3}},\"Mutex\":{\"acquireCount\":{\"r\":3}}},\"flowControl\":{\"acquireCount\":3,\"timeAcquiringMicros\":4},\"readConcern\":{\"provenance\":\"implicitDefault\"},\"writeConcern\":{\"w\":\"majority\",\"wtimeout\":0,\"provenance\":\"implicitDefault\"},\"storage\":{\"data\":{\"bytesRead\":3194,\"timeReadingMicros\":6},\"timeWaitingMicros\":{\"handleLock\":2,\"schemaLock\":229689}},\"remote\":\"10.42.5.87:33498\",\"protocol\":\"op_msg\",\"durationMillis\":256}}\n",
"stream": "stdout",
"time": "2023-08-11T20:06:54.292425547Z",
"t": {
"$date": "2023-08-11T20:06:54.292+00:00"
},
"s": "I",
"c": "COMMAND",
"id": 51803,
"ctx": "conn16",
"msg": "Slow query",
"attr": {
"type": "command",
"ns": "machine_jobs.hangfire.jobGraph",
"command": {
"findAndModify": "hangfire.jobGraph",
"query": {
"$and": [
{
"Queue": "smt_transfer"
},
{
"_t": "JobDto"
},
{
"FetchedAt": null
}
]
},
"update": {
"$set": {
"FetchedAt": {
"$date": "2023-08-11T20:06:51.404Z"
}
}
},
"new": true,
"txnNumber": 1,
"$db": "machine_jobs",
"lsid": {
"id": {
"$uuid": "0b474ddb-e39b-40f6-a47c-bb5b7ca4b4dd"
}
},
"$clusterTime": {
"clusterTime": {
"$timestamp": {
"t": 1691784357,
"i": 5
}
},
"signature": {
"hash": {
"$binary": {
"base64": "AAAAAAAAAAAAAAAAAAAAAAAAAAA=",
"subType": "0"
}
},
"keyId": 0
}
}
},
"planSummary": "IXSCAN { Queue: -1 }",
"keysExamined": 0,
"docsExamined": 0,
"fromMultiPlanner": true,
"nMatched": 0,
"nModified": 0,
"nUpserted": 0,
"numYields": 2,
"queryHash": "88F449FE",
"planCacheKey": "04E29608",
"reslen": 217,
"locks": {
"ParallelBatchWriterMode": {
"acquireCount": {
"r": 4
}
},
"FeatureCompatibilityVersion": {
"acquireCount": {
"r": 5,
"w": 3
}
},
"ReplicationStateTransition": {
"acquireCount": {
"w": 7
}
},
"Global": {
"acquireCount": {
"r": 5,
"w": 3
}
},
"Database": {
"acquireCount": {
"w": 3
}
},
"Collection": {
"acquireCount": {
"w": 3
}
},
"Mutex": {
"acquireCount": {
"r": 3
}
}
},
"flowControl": {
"acquireCount": 3,
"timeAcquiringMicros": 4
},
"readConcern": {
"provenance": "implicitDefault"
},
"writeConcern": {
"w": "majority",
"wtimeout": 0,
"provenance": "implicitDefault"
},
"storage": {
"data": {
"bytesRead": 3194,
"timeReadingMicros": 6
},
"timeWaitingMicros": {
"handleLock": 2,
"schemaLock": 229689
}
},
"remote": "10.42.5.87:33498",
"protocol": "op_msg",
"durationMillis": 256
}
}
I think it's related to https://github.com/HangfireIO/Hangfire/issues/1962.
Correction - 80+% of the logs are about aggregating the translation engines...
{
"type": "command",
"ns": "machine.translation_engines",
"command": {
"aggregate": "translation_engines",
"pipeline": [
{
"$changeStream": {
"fullDocument": "updateLookup",
"startAtOperationTime": {
"$timestamp": {
"t": 1691586149,
"i": 2
}
}
}
},
{
"$match": {
"documentKey._id": {
"$oid": "64d3523e513b403666940280"
},
"$or": [
{
"operationType": "delete"
},
{
"fullDocument.revision": {
"$gt": 7
}
}
]
}
}
],
"cursor": {},
"$db": "machine",
"lsid": {
"id": {
"$uuid": "0b71cca3-bf6c-44cc-b155-1eec87b3efec"
}
},
"$clusterTime": {
"clusterTime": {
"$timestamp": {
"t": 1691772760,
"i": 4
}
},
"signature": {
"hash": {
"$binary": {
"base64": "AAAAAAAAAAAAAAAAAAAAAAAAAAA=",
"subType": "0"
}
},
"keyId": 0
}
}
},
"planSummary": "COLLSCAN",
"cursorid": 5014928019613299000,
"keysExamined": 0,
"docsExamined": 165822,
"numYields": 226,
"nreturned": 3,
"queryHash": "427756DB",
"queryFramework": "classic",
"reslen": 310,
"locks": {
"ParallelBatchWriterMode": {
"acquireCount": {
"r": 1
}
},
"FeatureCompatibilityVersion": {
"acquireCount": {
"r": 232
}
},
"ReplicationStateTransition": {
"acquireCount": {
"w": 1
}
},
"Global": {
"acquireCount": {
"r": 232
}
},
"Database": {
"acquireCount": {
"r": 1
}
},
"Collection": {
"acquireCount": {
"r": 1
}
},
"Mutex": {
"acquireCount": {
"r": 6
}
}
},
"readConcern": {
"level": "majority"
},
"writeConcern": {
"w": "majority",
"wtimeout": 0,
"provenance": "implicitDefault"
},
"storage": {
"data": {
"bytesRead": 18609311,
"timeReadingMicros": 95082
}
},
"remote": "10.42.4.30:34558",
"protocol": "op_msg",
"durationMillis": 18404
}
Why are we deleting many objects? Why is each operation taking 30 seconds?
mongo slow log.txt This is a better file - shows the deletes - and the same id for many deletes - and the repetition of them!
For what it's worth, when a project is deleted in Scripture Forge, it triggers the deletion of every associated resource in Serval. I don't know whether that might be related. It doesn't seem like that should overwhelm MongoDB.
The deletes are repeating - I think it is in the Serval logic. It also just keeps repeating itself over and over again...
Not assigned to me, but just peeking. I notice the query plan is COLLSCAN meaning it's doing a full collection scan which is super slow. Could we create an index? The logic may be broken somewhere and there aren't a huge number of documents, but that might at least speed up these queries.
As for the logic, it looks like something is subscribing to deletes or changes to revision on a document??? I wonder where that is in the code haha. If you need any help debugging, I love dbs :).
The subscription itself is right here: https://github.com/sillsdev/serval/blob/main/src/SIL.DataAccess/MongoSubscription.cs line 37-42 which is getting called here https://github.com/sillsdev/serval/blob/main/src/Serval.Translation/Services/BuildService.cs line 43 which I'm guessing is getting called mostly by GetBuild and GetCurrentBuild in the translation engines controller. But both of those have a timeout. Even so, if there something on the SF side that kept polling calling GetBuild or, more likely, GetCurrentBuild, that could be the problem. Assuming that's true, using webhooks might be the solution.
Our UI (still in development) polls for the current build because that's currently the only option. As of now it's a 15s poll, which shouldn't be much of a strain at all. Obviously not an ideal solution, but I don't think it should be failing.
Some more observations:
Now for thoughts on what could be wrong and how we may fix it:
Might changing the pool size help? https://stackoverflow.com/questions/48411897/severe-performance-drop-with-mongodb-change-streams
Interesting about PoolSize - here is an article on it - it appears to already be set to 100: https://www.mongodb.com/docs/drivers/csharp/v2.19/faq/#how-does-connection-pooling-work-in-the-.net-c--driver-
Plan part A:
Plan part B:
Looking at the logs, I believe the culprit is the SubscribeForCancellation
class. There are a number of issues with the class. The main issue is that it will continue to subscribe to the translation_engines
collection indefinitely even after a job is finished.
The first fix is implemented (Thanks Damien for the find).
Here is an implementation plan:
rate (container_cpu_usage_seconds_total{image!="", namespace="serval", container!="POD"}[3m])
Mongo has been at 100% loaded for the past 2 days - what gives!