Closed harshita-chaudhary closed 2 years ago
Merging #591 (ea98861) into master (42e814f) will decrease coverage by
0.18%
. The diff coverage is50.90%
.:exclamation: Current head ea98861 differs from pull request most recent head 72b9910. Consider uploading reports for the commit 72b9910 to get more accurate results
@@ Coverage Diff @@
## master #591 +/- ##
==========================================
- Coverage 58.65% 58.46% -0.19%
==========================================
Files 102 102
Lines 6651 6715 +64
==========================================
+ Hits 3901 3926 +25
- Misses 2422 2461 +39
Partials 328 328
Impacted Files | Coverage Δ | |
---|---|---|
runner/status.go | 28.57% <0.00%> (-0.47%) |
:arrow_down: |
scheduler/server/stateful_scheduler.go | 63.95% <0.00%> (ø) |
|
worker/starter/server.go | 0.00% <0.00%> (ø) |
|
worker/starter/start_server.go | 0.00% <0.00%> (ø) |
|
runner/execer/os/execer.go | 42.77% <47.36%> (-8.84%) |
:arrow_down: |
runner/runners/queue.go | 76.62% <64.00%> (-0.65%) |
:arrow_down: |
runner/runners/invoke.go | 76.76% <100.00%> (+1.68%) |
:arrow_up: |
scheduler/server/cluster_state.go | 82.77% <100.00%> (+0.42%) |
:arrow_up: |
worker/domain/api.go | 80.99% <100.00%> (+0.15%) |
:arrow_up: |
... and 3 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 42e814f...72b9910. Read the comment docs.
Problem
Workers can get into bad states where they continuously have high memory consumption exceeding the soft cap threshold(could be due to persistent processes or memory leaks from previous runs). This would fail tasks with the memory cap exceeded exception as soon as they start running. Fixing these workers would require manual intervention to kill and have the worker reassigned to a fresh, uncorrupted host.
Solution
Add ability to mark worker status as unhealthy if memory consumption is observed to be higher than the threshold as soon as a task starts, and allow scheduler to query if the worker is unhealthy. The scheduler marks unhealthy workers as suspended to prevent further tasks from being scheduled on the corrupt host. The unhealthy suspended worker nodes are also marked lost by the scheduler since we don't yet have a way to recover an unhealthy node. The worker health check API(/health) can be used to identify and setup an automatic restart mechanism. Also, record metrics for better monitoring and tracking of memory related errors.