Mongo Pinging 100% - Githubissues

johnml1135 commented 1 year ago

rate (container_cpu_usage_seconds_total{image!="", namespace="serval", container!="POD"}[3m])

Mongo has been at 100% loaded for the past 2 days - what gives!

johnml1135 commented 1 year ago

I restarted Mongo and got this:

why?

johnml1135 commented 1 year ago

Here are the "really slow" messages that are not running after I restarted mongo:

{
  "log": "{\"t\":{\"$date\":\"2023-08-11T20:06:54.292+00:00\"},\"s\":\"I\",  \"c\":\"COMMAND\",  \"id\":51803,   \"ctx\":\"conn16\",\"msg\":\"Slow query\",\"attr\":{\"type\":\"command\",\"ns\":\"machine_jobs.hangfire.jobGraph\",\"command\":{\"findAndModify\":\"hangfire.jobGraph\",\"query\":{\"$and\":[{\"Queue\":\"smt_transfer\"},{\"_t\":\"JobDto\"},{\"FetchedAt\":null}]},\"update\":{\"$set\":{\"FetchedAt\":{\"$date\":\"2023-08-11T20:06:51.404Z\"}}},\"new\":true,\"txnNumber\":1,\"$db\":\"machine_jobs\",\"lsid\":{\"id\":{\"$uuid\":\"0b474ddb-e39b-40f6-a47c-bb5b7ca4b4dd\"}},\"$clusterTime\":{\"clusterTime\":{\"$timestamp\":{\"t\":1691784357,\"i\":5}},\"signature\":{\"hash\":{\"$binary\":{\"base64\":\"AAAAAAAAAAAAAAAAAAAAAAAAAAA=\",\"subType\":\"0\"}},\"keyId\":0}}},\"planSummary\":\"IXSCAN { Queue: -1 }\",\"keysExamined\":0,\"docsExamined\":0,\"fromMultiPlanner\":true,\"nMatched\":0,\"nModified\":0,\"nUpserted\":0,\"numYields\":2,\"queryHash\":\"88F449FE\",\"planCacheKey\":\"04E29608\",\"reslen\":217,\"locks\":{\"ParallelBatchWriterMode\":{\"acquireCount\":{\"r\":4}},\"FeatureCompatibilityVersion\":{\"acquireCount\":{\"r\":5,\"w\":3}},\"ReplicationStateTransition\":{\"acquireCount\":{\"w\":7}},\"Global\":{\"acquireCount\":{\"r\":5,\"w\":3}},\"Database\":{\"acquireCount\":{\"w\":3}},\"Collection\":{\"acquireCount\":{\"w\":3}},\"Mutex\":{\"acquireCount\":{\"r\":3}}},\"flowControl\":{\"acquireCount\":3,\"timeAcquiringMicros\":4},\"readConcern\":{\"provenance\":\"implicitDefault\"},\"writeConcern\":{\"w\":\"majority\",\"wtimeout\":0,\"provenance\":\"implicitDefault\"},\"storage\":{\"data\":{\"bytesRead\":3194,\"timeReadingMicros\":6},\"timeWaitingMicros\":{\"handleLock\":2,\"schemaLock\":229689}},\"remote\":\"10.42.5.87:33498\",\"protocol\":\"op_msg\",\"durationMillis\":256}}\n",
  "stream": "stdout",
  "time": "2023-08-11T20:06:54.292425547Z",
  "t": {
    "$date": "2023-08-11T20:06:54.292+00:00"
  },
  "s": "I",
  "c": "COMMAND",
  "id": 51803,
  "ctx": "conn16",
  "msg": "Slow query",
  "attr": {
    "type": "command",
    "ns": "machine_jobs.hangfire.jobGraph",
    "command": {
      "findAndModify": "hangfire.jobGraph",
      "query": {
        "$and": [
          {
            "Queue": "smt_transfer"
          },
          {
            "_t": "JobDto"
          },
          {
            "FetchedAt": null
          }
        ]
      },
      "update": {
        "$set": {
          "FetchedAt": {
            "$date": "2023-08-11T20:06:51.404Z"
          }
        }
      },
      "new": true,
      "txnNumber": 1,
      "$db": "machine_jobs",
      "lsid": {
        "id": {
          "$uuid": "0b474ddb-e39b-40f6-a47c-bb5b7ca4b4dd"
        }
      },
      "$clusterTime": {
        "clusterTime": {
          "$timestamp": {
            "t": 1691784357,
            "i": 5
          }
        },
        "signature": {
          "hash": {
            "$binary": {
              "base64": "AAAAAAAAAAAAAAAAAAAAAAAAAAA=",
              "subType": "0"
            }
          },
          "keyId": 0
        }
      }
    },
    "planSummary": "IXSCAN { Queue: -1 }",
    "keysExamined": 0,
    "docsExamined": 0,
    "fromMultiPlanner": true,
    "nMatched": 0,
    "nModified": 0,
    "nUpserted": 0,
    "numYields": 2,
    "queryHash": "88F449FE",
    "planCacheKey": "04E29608",
    "reslen": 217,
    "locks": {
      "ParallelBatchWriterMode": {
        "acquireCount": {
          "r": 4
        }
      },
      "FeatureCompatibilityVersion": {
        "acquireCount": {
          "r": 5,
          "w": 3
        }
      },
      "ReplicationStateTransition": {
        "acquireCount": {
          "w": 7
        }
      },
      "Global": {
        "acquireCount": {
          "r": 5,
          "w": 3
        }
      },
      "Database": {
        "acquireCount": {
          "w": 3
        }
      },
      "Collection": {
        "acquireCount": {
          "w": 3
        }
      },
      "Mutex": {
        "acquireCount": {
          "r": 3
        }
      }
    },
    "flowControl": {
      "acquireCount": 3,
      "timeAcquiringMicros": 4
    },
    "readConcern": {
      "provenance": "implicitDefault"
    },
    "writeConcern": {
      "w": "majority",
      "wtimeout": 0,
      "provenance": "implicitDefault"
    },
    "storage": {
      "data": {
        "bytesRead": 3194,
        "timeReadingMicros": 6
      },
      "timeWaitingMicros": {
        "handleLock": 2,
        "schemaLock": 229689
      }
    },
    "remote": "10.42.5.87:33498",
    "protocol": "op_msg",
    "durationMillis": 256
  }
}

johnml1135 commented 1 year ago

I think it's related to https://github.com/HangfireIO/Hangfire/issues/1962.

johnml1135 commented 1 year ago

Correction - 80+% of the logs are about aggregating the translation engines...

{
    "type": "command",
    "ns": "machine.translation_engines",
    "command": {
        "aggregate": "translation_engines",
        "pipeline": [
            {
                "$changeStream": {
                    "fullDocument": "updateLookup",
                    "startAtOperationTime": {
                        "$timestamp": {
                            "t": 1691586149,
                            "i": 2
                        }
                    }
                }
            },
            {
                "$match": {
                    "documentKey._id": {
                        "$oid": "64d3523e513b403666940280"
                    },
                    "$or": [
                        {
                            "operationType": "delete"
                        },
                        {
                            "fullDocument.revision": {
                                "$gt": 7
                            }
                        }
                    ]
                }
            }
        ],
        "cursor": {},
        "$db": "machine",
        "lsid": {
            "id": {
                "$uuid": "0b71cca3-bf6c-44cc-b155-1eec87b3efec"
            }
        },
        "$clusterTime": {
            "clusterTime": {
                "$timestamp": {
                    "t": 1691772760,
                    "i": 4
                }
            },
            "signature": {
                "hash": {
                    "$binary": {
                        "base64": "AAAAAAAAAAAAAAAAAAAAAAAAAAA=",
                        "subType": "0"
                    }
                },
                "keyId": 0
            }
        }
    },
    "planSummary": "COLLSCAN",
    "cursorid": 5014928019613299000,
    "keysExamined": 0,
    "docsExamined": 165822,
    "numYields": 226,
    "nreturned": 3,
    "queryHash": "427756DB",
    "queryFramework": "classic",
    "reslen": 310,
    "locks": {
        "ParallelBatchWriterMode": {
            "acquireCount": {
                "r": 1
            }
        },
        "FeatureCompatibilityVersion": {
            "acquireCount": {
                "r": 232
            }
        },
        "ReplicationStateTransition": {
            "acquireCount": {
                "w": 1
            }
        },
        "Global": {
            "acquireCount": {
                "r": 232
            }
        },
        "Database": {
            "acquireCount": {
                "r": 1
            }
        },
        "Collection": {
            "acquireCount": {
                "r": 1
            }
        },
        "Mutex": {
            "acquireCount": {
                "r": 6
            }
        }
    },
    "readConcern": {
        "level": "majority"
    },
    "writeConcern": {
        "w": "majority",
        "wtimeout": 0,
        "provenance": "implicitDefault"
    },
    "storage": {
        "data": {
            "bytesRead": 18609311,
            "timeReadingMicros": 95082
        }
    },
    "remote": "10.42.4.30:34558",
    "protocol": "op_msg",
    "durationMillis": 18404
}

johnml1135 commented 1 year ago

Why are we deleting many objects? Why is each operation taking 30 seconds?

johnml1135 commented 1 year ago

mongo slow log.txt This is a better file - shows the deletes - and the same id for many deletes - and the repetition of them!

Nateowami commented 1 year ago

For what it's worth, when a project is deleted in Scripture Forge, it triggers the deletion of every associated resource in Serval. I don't know whether that might be related. It doesn't seem like that should overwhelm MongoDB.

johnml1135 commented 1 year ago

The deletes are repeating - I think it is in the Serval logic. It also just keeps repeating itself over and over again...

Enkidu93 commented 1 year ago

Not assigned to me, but just peeking. I notice the query plan is COLLSCAN meaning it's doing a full collection scan which is super slow. Could we create an index? The logic may be broken somewhere and there aren't a huge number of documents, but that might at least speed up these queries.

As for the logic, it looks like something is subscribing to deletes or changes to revision on a document??? I wonder where that is in the code haha. If you need any help debugging, I love dbs :).

Enkidu93 commented 1 year ago

The subscription itself is right here: https://github.com/sillsdev/serval/blob/main/src/SIL.DataAccess/MongoSubscription.cs line 37-42 which is getting called here https://github.com/sillsdev/serval/blob/main/src/Serval.Translation/Services/BuildService.cs line 43 which I'm guessing is getting called mostly by GetBuild and GetCurrentBuild in the translation engines controller. But both of those have a timeout. Even so, if there something on the SF side that kept polling calling GetBuild or, more likely, GetCurrentBuild, that could be the problem. Assuming that's true, using webhooks might be the solution.

Nateowami commented 1 year ago

Our UI (still in development) polls for the current build because that's currently the only option. As of now it's a 15s poll, which shouldn't be much of a strain at all. Obviously not an ideal solution, but I don't think it should be failing.

johnml1135 commented 1 year ago

Some more observations:

The re-requests are about every 15 seconds, but not reliably (sometimes 10 seconds, sometimes 30).
Initially, the requests were taking 100 ms, in the end they were taking 30 seconds.
There were around 40 builds being actively queried at the end - that all got stopped when the mongodb and the machine job got restarted.

Now for thoughts on what could be wrong and how we may fix it:

Why are we using COLLSCAN? Is it possible just to watch the end and not scan all the changes? Performance should not degrade changes accumulate. Or, could changes be killed themselves after 30 seconds? Why are they hanging around forever? We should look here: https://www.mongodb.com/docs/manual/changeStreams/
Why are these not closing? Why are we continuing to look at 40 build jobs that clearly are not all running - and they remain open for DAYS! Is the timeout not properly working? Or, is SF asking for more changes when really the job is over. Should SF check first to see if the job is "pending" or "active" before requesting a change? Or by contrast, if a "min revision" is not requested, it should be just a "here is where it is".
The request pasted above (for 64d3523e513b403666940280) that was requesting the status for days, is not even queued right now, nor has it ever been built:
SF should see "this is not even queued - it's build state is "None" - I shouldn't just keep asking for the state of the build (that is, if it is continuing to...). I am not sure that it is because when the job was restarted, the requests did not come back. Therefore, I believe that the "timeout" did not work effectively - it did not kill the subscription (somewhere, somehow).
@Enkidu93 - if you could finish this investigation and propose some fixes, that would be excellent!

Enkidu93 commented 1 year ago

Might changing the pool size help? https://stackoverflow.com/questions/48411897/severe-performance-drop-with-mongodb-change-streams

johnml1135 commented 1 year ago

Interesting about PoolSize - here is an article on it - it appears to already be set to 100: https://www.mongodb.com/docs/drivers/csharp/v2.19/faq/#how-does-connection-pooling-work-in-the-.net-c--driver-

johnml1135 commented 1 year ago

Plan part A:

See it happen again - is it from many open connections, or many recycled connections? Why is each request taking so long? It is just looking through the whole Op log - is that the issue?

Plan part B:

Limit the size of the OpLog (can we limit it to 10MB?)
- Why is Hangfire making so many locks?
Monitor and alert on high CPU usage
Have confidence that the connection pool size is appropriate
Can we increase max_map_count in k8s? It needs to be done there, not in the helm chart... https://serverfault.com/questions/956361/kubernetes-docker-and-vm-max-map-count

ddaspit commented 1 year ago

Looking at the logs, I believe the culprit is the SubscribeForCancellation class. There are a number of issues with the class. The main issue is that it will continue to subscribe to the translation_engines collection indefinitely even after a job is finished.

johnml1135 commented 1 year ago

The first fix is implemented (Thanks Damien for the find).

johnml1135 commented 1 year ago

Here is an implementation plan:

SubscribeToCancel doesn't stop with successful build - fixed
Check limiting the oplog - can't be done meaningfully. At most we can change many of the subscriptions to webhooks - see #44
Connection pool and max_map_count are good enough with defaults - we are only doing a few 100 connections at most, not 10k or more. Ignore for now.
Alert on High CPU usage on any container (>50% for over 1 hour continuously, except for machine-job) - #87

sillsdev / serval

Mongo Pinging 100% #84