Closed Mark-Simulacrum closed 2 years ago
@bors r+
:pushpin: Commit 5f0f6ed0b2239166bb7c2c0132324a4e0751ba81 has been approved by Mark-Simulacrum
:hourglass: Testing commit 5f0f6ed0b2239166bb7c2c0132324a4e0751ba81 with merge 0b67e2c45b2c5dc7a611e29333d06a2a6845e096...
:broken_heart: Test failed - checks-actions
@bors r+
:pushpin: Commit 5442613437150d582ac62ef5df3f79bc8a0eaccc has been approved by Mark-Simulacrum
:hourglass: Testing commit 5442613437150d582ac62ef5df3f79bc8a0eaccc with merge 1bdaa7c5019b836f44bb78dbbd00846faac9df44...
:sunny: Test successful - checks-actions Approved by: Mark-Simulacrum Pushing 1bdaa7c5019b836f44bb78dbbd00846faac9df44 to master...
The worker's local task graph should be cleaned up by removing nodes after they succeed or fail, including the children of those nodes. We had a subtle edge case where if marking a child task as failed returned Err(...) -- for example due to the server being unavailable -- we would fail to actually remove the parent from the graph, instead early exiting (indeed, killing an entire worker thread).
This commit instead logs, but otherwise ignores, errors on marking tasks as failed on the server: if we don't succeed, we may be forced to rerun the task later as the server doesn't know of the failure. This failure should be transient though, so it should be OK if we end up rerunning the task as a result. I suspect that we don't 100% handle this situation ideally today -- for example, if a task always results in some failure on the server, we will loop indefinitely trying to run it. But I'm not sure why that would happen, and it seems like something that the server is better suited than worker nodes to handle (especially since worker nodes are ideally transient state-wise).
The server (likely after the recent mutex removal) will currently return 'database is busy' errors on some requests, which is a separate bug from the one being fixed here, but exacerbates the situations in which this bug arises.