sillsdev / serval

A REST API for natural language processing services
MIT License
4 stars 0 forks source link

Mongo Pinging 100% #84

Closed johnml1135 closed 1 year ago

johnml1135 commented 1 year ago

rate (container_cpu_usage_seconds_total{image!="", namespace="serval", container!="POD"}[3m])

image

Mongo has been at 100% loaded for the past 2 days - what gives!

johnml1135 commented 1 year ago

I restarted Mongo and got this: image

why?

johnml1135 commented 1 year ago

Here are the "really slow" messages that are not running after I restarted mongo:

{
  "log": "{\"t\":{\"$date\":\"2023-08-11T20:06:54.292+00:00\"},\"s\":\"I\",  \"c\":\"COMMAND\",  \"id\":51803,   \"ctx\":\"conn16\",\"msg\":\"Slow query\",\"attr\":{\"type\":\"command\",\"ns\":\"machine_jobs.hangfire.jobGraph\",\"command\":{\"findAndModify\":\"hangfire.jobGraph\",\"query\":{\"$and\":[{\"Queue\":\"smt_transfer\"},{\"_t\":\"JobDto\"},{\"FetchedAt\":null}]},\"update\":{\"$set\":{\"FetchedAt\":{\"$date\":\"2023-08-11T20:06:51.404Z\"}}},\"new\":true,\"txnNumber\":1,\"$db\":\"machine_jobs\",\"lsid\":{\"id\":{\"$uuid\":\"0b474ddb-e39b-40f6-a47c-bb5b7ca4b4dd\"}},\"$clusterTime\":{\"clusterTime\":{\"$timestamp\":{\"t\":1691784357,\"i\":5}},\"signature\":{\"hash\":{\"$binary\":{\"base64\":\"AAAAAAAAAAAAAAAAAAAAAAAAAAA=\",\"subType\":\"0\"}},\"keyId\":0}}},\"planSummary\":\"IXSCAN { Queue: -1 }\",\"keysExamined\":0,\"docsExamined\":0,\"fromMultiPlanner\":true,\"nMatched\":0,\"nModified\":0,\"nUpserted\":0,\"numYields\":2,\"queryHash\":\"88F449FE\",\"planCacheKey\":\"04E29608\",\"reslen\":217,\"locks\":{\"ParallelBatchWriterMode\":{\"acquireCount\":{\"r\":4}},\"FeatureCompatibilityVersion\":{\"acquireCount\":{\"r\":5,\"w\":3}},\"ReplicationStateTransition\":{\"acquireCount\":{\"w\":7}},\"Global\":{\"acquireCount\":{\"r\":5,\"w\":3}},\"Database\":{\"acquireCount\":{\"w\":3}},\"Collection\":{\"acquireCount\":{\"w\":3}},\"Mutex\":{\"acquireCount\":{\"r\":3}}},\"flowControl\":{\"acquireCount\":3,\"timeAcquiringMicros\":4},\"readConcern\":{\"provenance\":\"implicitDefault\"},\"writeConcern\":{\"w\":\"majority\",\"wtimeout\":0,\"provenance\":\"implicitDefault\"},\"storage\":{\"data\":{\"bytesRead\":3194,\"timeReadingMicros\":6},\"timeWaitingMicros\":{\"handleLock\":2,\"schemaLock\":229689}},\"remote\":\"10.42.5.87:33498\",\"protocol\":\"op_msg\",\"durationMillis\":256}}\n",
  "stream": "stdout",
  "time": "2023-08-11T20:06:54.292425547Z",
  "t": {
    "$date": "2023-08-11T20:06:54.292+00:00"
  },
  "s": "I",
  "c": "COMMAND",
  "id": 51803,
  "ctx": "conn16",
  "msg": "Slow query",
  "attr": {
    "type": "command",
    "ns": "machine_jobs.hangfire.jobGraph",
    "command": {
      "findAndModify": "hangfire.jobGraph",
      "query": {
        "$and": [
          {
            "Queue": "smt_transfer"
          },
          {
            "_t": "JobDto"
          },
          {
            "FetchedAt": null
          }
        ]
      },
      "update": {
        "$set": {
          "FetchedAt": {
            "$date": "2023-08-11T20:06:51.404Z"
          }
        }
      },
      "new": true,
      "txnNumber": 1,
      "$db": "machine_jobs",
      "lsid": {
        "id": {
          "$uuid": "0b474ddb-e39b-40f6-a47c-bb5b7ca4b4dd"
        }
      },
      "$clusterTime": {
        "clusterTime": {
          "$timestamp": {
            "t": 1691784357,
            "i": 5
          }
        },
        "signature": {
          "hash": {
            "$binary": {
              "base64": "AAAAAAAAAAAAAAAAAAAAAAAAAAA=",
              "subType": "0"
            }
          },
          "keyId": 0
        }
      }
    },
    "planSummary": "IXSCAN { Queue: -1 }",
    "keysExamined": 0,
    "docsExamined": 0,
    "fromMultiPlanner": true,
    "nMatched": 0,
    "nModified": 0,
    "nUpserted": 0,
    "numYields": 2,
    "queryHash": "88F449FE",
    "planCacheKey": "04E29608",
    "reslen": 217,
    "locks": {
      "ParallelBatchWriterMode": {
        "acquireCount": {
          "r": 4
        }
      },
      "FeatureCompatibilityVersion": {
        "acquireCount": {
          "r": 5,
          "w": 3
        }
      },
      "ReplicationStateTransition": {
        "acquireCount": {
          "w": 7
        }
      },
      "Global": {
        "acquireCount": {
          "r": 5,
          "w": 3
        }
      },
      "Database": {
        "acquireCount": {
          "w": 3
        }
      },
      "Collection": {
        "acquireCount": {
          "w": 3
        }
      },
      "Mutex": {
        "acquireCount": {
          "r": 3
        }
      }
    },
    "flowControl": {
      "acquireCount": 3,
      "timeAcquiringMicros": 4
    },
    "readConcern": {
      "provenance": "implicitDefault"
    },
    "writeConcern": {
      "w": "majority",
      "wtimeout": 0,
      "provenance": "implicitDefault"
    },
    "storage": {
      "data": {
        "bytesRead": 3194,
        "timeReadingMicros": 6
      },
      "timeWaitingMicros": {
        "handleLock": 2,
        "schemaLock": 229689
      }
    },
    "remote": "10.42.5.87:33498",
    "protocol": "op_msg",
    "durationMillis": 256
  }
}
johnml1135 commented 1 year ago

I think it's related to https://github.com/HangfireIO/Hangfire/issues/1962.

johnml1135 commented 1 year ago

Correction - 80+% of the logs are about aggregating the translation engines...

{
    "type": "command",
    "ns": "machine.translation_engines",
    "command": {
        "aggregate": "translation_engines",
        "pipeline": [
            {
                "$changeStream": {
                    "fullDocument": "updateLookup",
                    "startAtOperationTime": {
                        "$timestamp": {
                            "t": 1691586149,
                            "i": 2
                        }
                    }
                }
            },
            {
                "$match": {
                    "documentKey._id": {
                        "$oid": "64d3523e513b403666940280"
                    },
                    "$or": [
                        {
                            "operationType": "delete"
                        },
                        {
                            "fullDocument.revision": {
                                "$gt": 7
                            }
                        }
                    ]
                }
            }
        ],
        "cursor": {},
        "$db": "machine",
        "lsid": {
            "id": {
                "$uuid": "0b71cca3-bf6c-44cc-b155-1eec87b3efec"
            }
        },
        "$clusterTime": {
            "clusterTime": {
                "$timestamp": {
                    "t": 1691772760,
                    "i": 4
                }
            },
            "signature": {
                "hash": {
                    "$binary": {
                        "base64": "AAAAAAAAAAAAAAAAAAAAAAAAAAA=",
                        "subType": "0"
                    }
                },
                "keyId": 0
            }
        }
    },
    "planSummary": "COLLSCAN",
    "cursorid": 5014928019613299000,
    "keysExamined": 0,
    "docsExamined": 165822,
    "numYields": 226,
    "nreturned": 3,
    "queryHash": "427756DB",
    "queryFramework": "classic",
    "reslen": 310,
    "locks": {
        "ParallelBatchWriterMode": {
            "acquireCount": {
                "r": 1
            }
        },
        "FeatureCompatibilityVersion": {
            "acquireCount": {
                "r": 232
            }
        },
        "ReplicationStateTransition": {
            "acquireCount": {
                "w": 1
            }
        },
        "Global": {
            "acquireCount": {
                "r": 232
            }
        },
        "Database": {
            "acquireCount": {
                "r": 1
            }
        },
        "Collection": {
            "acquireCount": {
                "r": 1
            }
        },
        "Mutex": {
            "acquireCount": {
                "r": 6
            }
        }
    },
    "readConcern": {
        "level": "majority"
    },
    "writeConcern": {
        "w": "majority",
        "wtimeout": 0,
        "provenance": "implicitDefault"
    },
    "storage": {
        "data": {
            "bytesRead": 18609311,
            "timeReadingMicros": 95082
        }
    },
    "remote": "10.42.4.30:34558",
    "protocol": "op_msg",
    "durationMillis": 18404
}
johnml1135 commented 1 year ago

Why are we deleting many objects? Why is each operation taking 30 seconds?

johnml1135 commented 1 year ago

mongo slow log.txt This is a better file - shows the deletes - and the same id for many deletes - and the repetition of them!

Nateowami commented 1 year ago

For what it's worth, when a project is deleted in Scripture Forge, it triggers the deletion of every associated resource in Serval. I don't know whether that might be related. It doesn't seem like that should overwhelm MongoDB.

johnml1135 commented 1 year ago

The deletes are repeating - I think it is in the Serval logic. It also just keeps repeating itself over and over again...

Enkidu93 commented 1 year ago

Not assigned to me, but just peeking. I notice the query plan is COLLSCAN meaning it's doing a full collection scan which is super slow. Could we create an index? The logic may be broken somewhere and there aren't a huge number of documents, but that might at least speed up these queries.

As for the logic, it looks like something is subscribing to deletes or changes to revision on a document??? I wonder where that is in the code haha. If you need any help debugging, I love dbs :).

Enkidu93 commented 1 year ago

The subscription itself is right here: https://github.com/sillsdev/serval/blob/main/src/SIL.DataAccess/MongoSubscription.cs line 37-42 which is getting called here https://github.com/sillsdev/serval/blob/main/src/Serval.Translation/Services/BuildService.cs line 43 which I'm guessing is getting called mostly by GetBuild and GetCurrentBuild in the translation engines controller. But both of those have a timeout. Even so, if there something on the SF side that kept polling calling GetBuild or, more likely, GetCurrentBuild, that could be the problem. Assuming that's true, using webhooks might be the solution.

Nateowami commented 1 year ago

Our UI (still in development) polls for the current build because that's currently the only option. As of now it's a 15s poll, which shouldn't be much of a strain at all. Obviously not an ideal solution, but I don't think it should be failing.

johnml1135 commented 1 year ago

Some more observations:

Now for thoughts on what could be wrong and how we may fix it:

Enkidu93 commented 1 year ago

Might changing the pool size help? https://stackoverflow.com/questions/48411897/severe-performance-drop-with-mongodb-change-streams

johnml1135 commented 1 year ago

Interesting about PoolSize - here is an article on it - it appears to already be set to 100: https://www.mongodb.com/docs/drivers/csharp/v2.19/faq/#how-does-connection-pooling-work-in-the-.net-c--driver-

johnml1135 commented 1 year ago

Plan part A:

Plan part B:

ddaspit commented 1 year ago

Looking at the logs, I believe the culprit is the SubscribeForCancellation class. There are a number of issues with the class. The main issue is that it will continue to subscribe to the translation_engines collection indefinitely even after a job is finished.

johnml1135 commented 1 year ago

The first fix is implemented (Thanks Damien for the find).

johnml1135 commented 1 year ago

Here is an implementation plan: