mozilla-services / updatebot

Automation for updating third party libraries for Firefox
Mozilla Public License 2.0
8 stars 5 forks source link

Relinquishing a job with an unexpected create status creates an inconsistency in relinquished versions #368

Closed tomrittervg closed 1 month ago

tomrittervg commented 3 months ago

52da7211c9f812520cfb1f59b9fd8793df74cd46 which fixed #346 contains a bug:

  1. Make a job, everything's fine. It's the most recent job so it's not relinquished.
  2. Make a job, something goes wrong, it winds up in unexpected_created_status
  3. Run a job, notice the busted job with unexpected_created_status - relinquish it.

Now the most recent job is not the relinquished one. And it breaks our invariant.

I think the solution is to take the code that addresses the most recent job and extract it and use it where it is now and also when handling the weird job,

tomrittervg commented 3 months ago

The other important task is to figure out why we didn't have a test for this.

sentry-io[bot] commented 3 months ago

Sentry Issue: UPDATEBOT-PROD-1H

mozfreddyb commented 2 months ago

Looks like this appeared again? This time with the following details

Most Recent Job is 445, we have 1 non-relinquished jobs ([<Job id: 404 library: perfetto>]), and they don't match.

mozfreddyb commented 1 month ago

@maltejur You helped looking into #369, but this started occuring again. Do you know if this needs to be fixed in the DB or if this can be done by us and with a patch?

maltejur commented 1 month ago

This only seems to be a single failure with the library perfetto. And from looking into the logs, I think this particular failure still is fallout from https://github.com/mozilla-services/updatebot/issues/367.

What happened is that updatebot created a new job (445) because perfetto was updated to v45.0. But directly after creating the new job, and before being able to relinquish the old (done but not relinquished) job (404), https://github.com/mozilla-services/updatebot/issues/367 happened (you can see that at the end of the log for that job). This failiure also meant the new job (445) was then marked as relinquished. But the old job (404) was still not relinquished. This only became a problem though recently when perfetto had another update to v46.0. Now, every time updatebot wants to start a new job to update perfetto, it encounters the database in a weird state where the most recent job is relinquished, but the previous one isn't.

I am not sure about any code changes we could do to prevent this from happening in the future, in case there is a failure between creating a new job and relinquishing the old one. But for this instance, I think we should just mark the old job (404) as relinquished in the database.

@mozfreddyb do you also have database access, or do we need to wait for Tom to get back for that?

maltejur commented 1 month ago

For reference, we ran the following just now:

UPDATE updatebot.jobs
    SET relinquished=1
    WHERE id=404;

I'll wait until the next updatebot run and then hopefully close this issue if updatebot correctly creates a new job for perfetto v46.0.

maltejur commented 1 month ago

A new job seems to have been created for perfetto in https://bugzilla.mozilla.org/show_bug.cgi?id=1907314, so I'll close this bug. There does seem to be an unrelated issue with perfetto in that job though, which I am going to open a follow up issue about.

EDIT: The new issue just seems to be a broken patch, which will need to be fixed by the perf team and not us.