System monitoring to alert when ML takes longer than typical to respond to requests (from 630)

gigamorph commented 8 months ago

@guoxinji,

We would like to know if a ML cluster's response time dips below an established threshold. I think that could be a good way to detect when something may have gone wrong whereby the ML processes are still running but have potentially entered an unhealthy state. Not enough to trip the AWS auto scaling test though --which is not necessarily what we want as there could be legitimate, sustained load (e.g., QA's performance test). If we were to monitor query response time, it could be a relatively simple search that typically returns quickly, regardless of cache status. If it typically returned with a second, we could set the threshold to a number we believe or know it to take longer once in an unhealthy state and require it to fail n times before sending the alert.

But this was just the first idea that popped into my head. Perhaps there are existing tests that could be tweaked or extended instead? We could also review https://git.yale.edu/lux-its/marklogic/blob/main/docs/lux-backend-system-monitoring.md#distilled-advise.

On 24 Oct 22, we believe SBX entered an unhealthy state, and stayed in it until the ML processes were restarted the following morning. Just before that restart, Rob stated "Even simple queries aren't loading, either in query console or via the front end." I scanned the data available from ML's monitoring app for the previous 24 hours. Memory utilization stood out:

We elected not to dive into the logs.

But we are interested in how often this may be happening, especially while we're using ML 11 EA builds.

cc: @jac237, @prowns, @rs2668

xinjianguo commented 6 months ago

if query takes longer time but eventually it returns, monitoring the RequestLog should be enough.

If query never returns, the cluster must be in an unhealthy state, CPU or memory usage or access ping should trigger an alert.

brent-hartwig commented 6 months ago

I believe there are two more scenarios:

The original intent of this ticket was to give us an early heads up that something may be going awry:
- The monitoring tests for long running requests alert for completed requests that take longer than 99 seconds.
- Our app servers are configured to time out after 59 seconds.
- Search requests are to time out after 20 seconds.
- Many/most of the search requests are subsecond.
- If a search request that is known to return with x (milli)seconds cold or y (milli)seconds warm starts to take 2x or 3x as long, that could be when we want to start investigating --this is the scenario I submitted this ticket for and believe it can alert us to a variety of possible problems.
We have witnessed "zombie" requests that can run for hours or days yet not consume a noticeable amount of system resources. Yet if holding a thread, we want it released. These have loosely been associated with v8 engine crashes. For those, we have the broader monitoring test for "Hung" in ErrorLog.txt. In the future, if we happen upon zombie requests that weren't picked up by any of our monitoring tests, we can consider a test that asks the app servers for requests that have run longer than we deem acceptable.

roamye commented 2 months ago

UAT 8/26:

@xinjianguo & @brent-hartwig - do you two have time to meet with each other to sort what else is needed to be done?

There needs to be a clear decision on what we are moving forward with. Moving back to forming until you two have a discussion on this.

brent-hartwig commented 2 months ago

@roamye, I think the first question is whether we want to pursue. I believe there is value but we've been fine without for over six months. @jffcamp and @prowns should probably weigh in.

roamye commented 2 months ago

@jffcamp / @prowns - from Brent's comment above is this something we would like to pursue?

If we do not want to then propose close?

roamye commented 1 month ago

Added this to UAT 9/12 to determine pursual.

UAT 9/12: Moving to future (for now) if this happens again we can move this back to forming and figure out performance test impact/cost

@brent-hartwig

project-lux / lux-marklogic

System monitoring to alert when ML takes longer than typical to respond to requests (from 630) #13