twitter / scoot

Scoot is a distributed task runner, supporting both a proprietary API and Bazel's Remote Execution.
Apache License 2.0
349 stars 48 forks source link

Add ability to mark worker unhealthy in case of persistent high memory & add monitoring metrics #591

Closed harshita-chaudhary closed 2 years ago

harshita-chaudhary commented 2 years ago

Problem

Workers can get into bad states where they continuously have high memory consumption exceeding the soft cap threshold(could be due to persistent processes or memory leaks from previous runs). This would fail tasks with the memory cap exceeded exception as soon as they start running. Fixing these workers would require manual intervention to kill and have the worker reassigned to a fresh, uncorrupted host.

Solution

Add ability to mark worker status as unhealthy if memory consumption is observed to be higher than the threshold as soon as a task starts, and allow scheduler to query if the worker is unhealthy. The scheduler marks unhealthy workers as suspended to prevent further tasks from being scheduled on the corrupt host. The unhealthy suspended worker nodes are also marked lost by the scheduler since we don't yet have a way to recover an unhealthy node. The worker health check API(/health) can be used to identify and setup an automatic restart mechanism. Also, record metrics for better monitoring and tracking of memory related errors.

codecov-commenter commented 2 years ago

Codecov Report

Merging #591 (ea98861) into master (42e814f) will decrease coverage by 0.18%. The diff coverage is 50.90%.

:exclamation: Current head ea98861 differs from pull request most recent head 72b9910. Consider uploading reports for the commit 72b9910 to get more accurate results

@@            Coverage Diff             @@
##           master     #591      +/-   ##
==========================================
- Coverage   58.65%   58.46%   -0.19%     
==========================================
  Files         102      102              
  Lines        6651     6715      +64     
==========================================
+ Hits         3901     3926      +25     
- Misses       2422     2461      +39     
  Partials      328      328              
Impacted Files Coverage Δ
runner/status.go 28.57% <0.00%> (-0.47%) :arrow_down:
scheduler/server/stateful_scheduler.go 63.95% <0.00%> (ø)
worker/starter/server.go 0.00% <0.00%> (ø)
worker/starter/start_server.go 0.00% <0.00%> (ø)
runner/execer/os/execer.go 42.77% <47.36%> (-8.84%) :arrow_down:
runner/runners/queue.go 76.62% <64.00%> (-0.65%) :arrow_down:
runner/runners/invoke.go 76.76% <100.00%> (+1.68%) :arrow_up:
scheduler/server/cluster_state.go 82.77% <100.00%> (+0.42%) :arrow_up:
worker/domain/api.go 80.99% <100.00%> (+0.15%) :arrow_up:
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 42e814f...72b9910. Read the comment docs.