Open gigamorph opened 8 months ago
if query takes longer time but eventually it returns, monitoring the RequestLog should be enough.
If query never returns, the cluster must be in an unhealthy state, CPU or memory usage or access ping should trigger an alert.
I believe there are two more scenarios:
UAT 8/26:
@xinjianguo & @brent-hartwig - do you two have time to meet with each other to sort what else is needed to be done?
There needs to be a clear decision on what we are moving forward with. Moving back to forming until you two have a discussion on this.
@roamye, I think the first question is whether we want to pursue. I believe there is value but we've been fine without for over six months. @jffcamp and @prowns should probably weigh in.
@jffcamp / @prowns - from Brent's comment above is this something we would like to pursue?
If we do not want to then propose close?
Added this to UAT 9/12 to determine pursual.
UAT 9/12: Moving to future (for now) if this happens again we can move this back to forming and figure out performance test impact/cost
@brent-hartwig
@guoxinji,
We would like to know if a ML cluster's response time dips below an established threshold. I think that could be a good way to detect when something may have gone wrong whereby the ML processes are still running but have potentially entered an unhealthy state. Not enough to trip the AWS auto scaling test though --which is not necessarily what we want as there could be legitimate, sustained load (e.g., QA's performance test). If we were to monitor query response time, it could be a relatively simple search that typically returns quickly, regardless of cache status. If it typically returned with a second, we could set the threshold to a number we believe or know it to take longer once in an unhealthy state and require it to fail n times before sending the alert.
But this was just the first idea that popped into my head. Perhaps there are existing tests that could be tweaked or extended instead? We could also review https://git.yale.edu/lux-its/marklogic/blob/main/docs/lux-backend-system-monitoring.md#distilled-advise.
On 24 Oct 22, we believe SBX entered an unhealthy state, and stayed in it until the ML processes were restarted the following morning. Just before that restart, Rob stated "Even simple queries aren't loading, either in query console or via the front end." I scanned the data available from ML's monitoring app for the previous 24 hours. Memory utilization stood out:
We elected not to dive into the logs.
But we are interested in how often this may be happening, especially while we're using ML 11 EA builds.
cc: @jac237, @prowns, @rs2668