Closed Mark-Simulacrum closed 2 years ago
Locally failure seems to be due to OOM (~54 GB compilation on stable), seems to be fixed on nightly, though not on beta.
I think I know why there's not more information: I was looking at the crater server, but I should've been looking at agent logs. I'll try to track down the cause of the failure and actually fix it.
@bors r=pietroalbini
:pushpin: Commit 2b49049eb99c91ff601dcf4414b45c3473dabcd8 has been approved by pietroalbini
:hourglass: Testing commit 2b49049eb99c91ff601dcf4414b45c3473dabcd8 with merge 7f007881ae6556c3b7823ae05773e105002dc558...
@bors r=pietroalbini
:pushpin: Commit 0abcb8c79d49422075c1832debc9e4bc26608cb4 has been approved by pietroalbini
:hourglass: Testing commit 0abcb8c79d49422075c1832debc9e4bc26608cb4 with merge 5ffcaa41f039ecd1eb91df783648c757135c5191...
:broken_heart: Test failed - checks-actions
@bors r+
:pushpin: Commit 139d0942f730cb11b5daecf6250fccab0ce3e7e9 has been approved by Mark-Simulacrum
:hourglass: Testing commit 139d0942f730cb11b5daecf6250fccab0ce3e7e9 with merge d7b9fb2ebaec06658c1a6ad0ee8eb2cdb463220d...
:broken_heart: Test failed - checks-actions
@bors r+
:pushpin: Commit 08180834ef6e3cb72d4ba4018a3a46133523d63d has been approved by Mark-Simulacrum
:hourglass: Testing commit 08180834ef6e3cb72d4ba4018a3a46133523d63d with merge 64bc28fb41f554df62b9cea39379eb525f5b5d2d...
:sunny: Test successful - checks-actions Approved by: Mark-Simulacrum Pushing 64bc28fb41f554df62b9cea39379eb525f5b5d2d to master...
Currently, if an individual agent reports an error during execution (e.g., docker is not running, or one of its worker threads ended with an error), that job will be marked as failed in its entirety. Particularly as we currently have a transient agent (crater-gcp-1), which is sometimes down due to being a spot instance, this means that it can be hard to complete a Crater job if the GCP-1 instance is killing jobs midway through.
It should be noted that in theory these errors shouldn't happen in the first place. In practice it looks like "docker is not running" is the primary cause of failure -- which is relatively hard to investigate; logs for the relevant time period appear absent. This PR restructures the code which detects docker absence to instead spin until docker is up. It looks relatively more difficult to re-organize the crater code to deal well with worker failure, likely by re-assigning the jobs to a live worker, though that is likely a better long-term solution.