Closed Mark-Simulacrum closed 2 years ago
The code change looks good to me, r=me depending on what you want to do with the message below.
In the future, we will likely also want some tracking/reporting of these failures such that hard errors across all agents lead to alerting the infra team, but this may be best done through Prometheus metrics rather than GitHub "end-user" visible comments (which are likely not the best way to reach infra team members regardless).
To store the error messages (and avoid them being drowned in so many other log messages) we could use Sentry, as it has nice features like deduplication.
In the meantime, could you add a simple counter metric to count how many of those failures we detect (so that we can setup some prometheus alerts)?
Sentry feels like a pretty large lift. I think eliminating the "we received a report" log message is likely to have better effect -- those messages feel /generally/ pretty useless to me, comparatively, and would cut down a lot on the log noise. I can do that in a separate PR.
I'll add the counter metric -- maybe indexed by worker -- and then approve this.
@bors r=pietroalbini
@Mark-Simulacrum: :key: Insufficient privileges: Not in reviewers
(FWIW, the expectation is that agents don't fail -- some of my previous PRs went towards trying to make that more true. So I expect the amount of times this happens to be low; the intent of this PR is that when it does happen we don't make a big noise about it and can largely keep running if there's not a super high rate that means we have to investigate).
@bors r=pietroalbini
:pushpin: Commit d0b7ab9f31b67af5d2572244f40f649d540ba8b4 has been approved by pietroalbini
:hourglass: Testing commit d0b7ab9f31b67af5d2572244f40f649d540ba8b4 with merge 2e0a518e03edffabb515576ccd8348c6f34d8849...
:sunny: Test successful - checks-actions Approved by: pietroalbini Pushing 2e0a518e03edffabb515576ccd8348c6f34d8849 to master...
Previously, any agent reporting an error would cause an experiment to come to a hard stop. This is noisy (leads to a user-visible report) and, at least lately, not actually representative of a terminal failure. Instead, we take the approach of re-queueing any crates assigned to that agent and logging an error on the server host specifying which agent failed.
As the error message is now not typically posted publicly (and no longer needs to be short for being in a GitHub comment), we will also have more opportunity to add extra metadata over time to that message (in future PRs) enabling diagnosing and fixing these errors.
In the future, we will likely also want some tracking/reporting of these failures such that hard errors across all agents lead to alerting the infra team, but this may be best done through Prometheus metrics rather than GitHub "end-user" visible comments (which are likely not the best way to reach infra team members regardless).
r? @pietroalbini