Ensure failed nodes are deleted from worker graph

Mark-Simulacrum commented 2 years ago

The worker's local task graph should be cleaned up by removing nodes after they succeed or fail, including the children of those nodes. We had a subtle edge case where if marking a child task as failed returned Err(...) -- for example due to the server being unavailable -- we would fail to actually remove the parent from the graph, instead early exiting (indeed, killing an entire worker thread).

This commit instead logs, but otherwise ignores, errors on marking tasks as failed on the server: if we don't succeed, we may be forced to rerun the task later as the server doesn't know of the failure. This failure should be transient though, so it should be OK if we end up rerunning the task as a result. I suspect that we don't 100% handle this situation ideally today -- for example, if a task always results in some failure on the server, we will loop indefinitely trying to run it. But I'm not sure why that would happen, and it seems like something that the server is better suited than worker nodes to handle (especially since worker nodes are ideally transient state-wise).

The server (likely after the recent mutex removal) will currently return 'database is busy' errors on some requests, which is a separate bug from the one being fixed here, but exacerbates the situations in which this bug arises.

Mark-Simulacrum commented 2 years ago

@bors r+