mozilla-services / updatebot

Automation for updating third party libraries for Firefox
Mozilla Public License 2.0
8 stars 5 forks source link

Better Error Handling when git history stuff falls down or we have an unexpected create status #346

Closed tomrittervg closed 4 months ago

tomrittervg commented 6 months ago
CompletedProcess(args=['git', 'diff', '--name-status', 'c41afe41f0fb13b80ca79e987a21ea411e1545a7^', 'c41afe41f0fb13b80ca79e987a21ea411e1545a7'], returncode=128, stdout=b'', stderr=b"fatal: ambiguous argument 'c41afe41f0fb13b80ca79e987a21ea411e1545a7^': unknown revision or path not in the working tree.\nUse '--' to separate paths from revisions, like this:\n'git <command> [<revision>...] -- [<file>...]'\n")

This led to raise Exception("Handled a Job with unexpected CREATED status in _process_existing_job") in the next job and then subsequently a bunch of assert len(non_relinquished_jobs) <= 1, "We got more than one non-relinquished job: %s" % non_relinquished_jobs

tomrittervg commented 6 months ago

So c41afe41f0fb13b80ca79e987a21ea411e1545a7 is very old commit in the opus repo from 2018.

Some logs:

check_for_update:
 Arguments: (<components.scmprovider.SCMProvider object at 0x7f439c8717d0>, {{ opus  Core  Audio/Video: Playback  c85499757c148fede8604cffa12454206b6138ba  https://gitlab.xiph.org/xiph/opus  False  regular  kinetik@flim.org  kinetik  None  None  None  media/libopus/moz.yaml [{vendoring True None [] [] every linux None None None }] }}, {vendoring True None [] [] every linux None None None }, 'f20575dd8632d371a51478ff55996d4ecc35ff1e', <Job id: 300 library: opus>) {}
Running ['git', 'merge-base', 'c85499757c148fede8604cffa12454206b6138ba', 'f20575dd8632d371a51478ff55996d4ecc35ff1e']
> 9fc8fc4cf432640f284113ba502ee027268b0d9f
git log --pretty=%H|%ai|%ci --no-merges 9fc8fc4cf432640f284113ba502ee027268b0d9f..f20575dd8632d371a51478ff55996d4ecc35ff1e
  1. new_version = f20575dd8632d371a51478ff55996d4ecc35ff1e
  2. most_recent_job = 300, which has sha=c85499757c148fede8604cffa12454206b6138ba
tomrittervg commented 6 months ago

Something seems screwy with the opus repo.

  1. If you git checkout c85499757c148fede8604cffa12454206b6138ba and then git log, and compare it to git checkout f20575dd8632d371a51478ff55996d4ecc35ff1e you'll find that f20575dd8632d371a51478ff55996d4ecc35ff1e has a lot of commits (going back to c41afe41f0fb13b80ca79e987a21ea411e1545a7 in 2018) that c85499757c148fede8604cffa12454206b6138ba does not have. That seems to be the reason why we were getting that very old commit showing up in some capacity...
  2. c85499757c148fede8604cffa12454206b6138ba is not in the log for git checkout f20575dd8632d371a51478ff55996d4ecc35ff1e.
  3. origin/master == c85499757c148fede8604cffa12454206b6138ba
  4. origin/main == 2582ca925988cd53edf0c4d6690dba2c66fde8e8 which is a few commits past f20575dd8632d371a51478ff55996d4ecc35ff1e which has the branch name 'opus-ng'

It appears like development was switching from the master branch to the main branch, and they do not have a very cohesive history together. main has some of the commits from master, but far from all of them.

So it seems like what happened is first we asked git for the first common ancestor of in-tree-version and new-version, call it merge-base. Then we asked git for all the commits between merge-base and new-version by saying git log --pretty=%H|%ai|%ci --no-merges 9fc8fc4cf432640f284113ba502ee027268b0d9f..f20575dd8632d371a51478ff55996d4ecc35ff1e. But that's not what that command means.

A special notation "<commit1>..<commit2>" can be used as a short-hand for "^<commit1> <commit2>". For example, either of the
       following may be used interchangeably:

           $ git log origin..HEAD
           $ git log HEAD ^origin

aka "list all the commits which are reachable from HEAD, but not from origin". That is not the same thing as "all the commits between origin and HEAD.

I'll not that continuing a close read of git log --help:

Another special notation is "<commit1>...<commit2>" which is useful for merges. The resulting set of commits is the symmetric
       difference between the two operands. The following two commands are equivalent:

           $ git log A B --not $(git merge-base --all A B)
           $ git log A...B

but AFAICT this distinction isn't important because they give the same result.

list all the commits which are reachable from f20575dd8632d371a51478ff55996d4ecc35ff1e, but not from 9fc8fc4cf432640f284113ba502ee027268b0d9f

c41afe41f0fb13b80ca79e987a21ea411e1545a7 is in the history (reachable) from f20575dd8632d371a51478ff55996d4ecc35ff1e; but not from our starting revision c85499757c148fede8604cffa12454206b6138ba.

The command that failed (git diff --name-status c41afe41f0fb13b80ca79e987a21ea411e1545a7^ c41afe41f0fb13b80ca79e987a21ea411e1545a7) said "Show me the diff between the parent of c41afe41f0fb13b80ca79e987a21ea411e1545a7 and c41afe41f0fb13b80ca79e987a21ea411e1545a7. This failed because c41afe41f0fb13b80ca79e987a21ea411e1545a7 has no parent. It's the initial commit of its branch (which was eventually merged into main; but never to master.

And because c41afe41f0fb13b80ca79e987a21ea411e1545a7 was never in master (but c85499757c148fede8604cffa12454206b6138ba was) it showed up in the 'give me all the commits reachable from this commit in main but not from this commit in master.

tomrittervg commented 6 months ago

I'm not sure what Updatebot could have done better here. In theory we could add a bunch of weird-case-handling to the git stuff, but that'd be horrible to test and god willing won't be needed again....

I would say a better plan would just to have better error handle and reporting, so we can something to the effect of "You should update to Revision X, but I couldn't do it, sorry."

tomrittervg commented 5 months ago

We had a similar situation happen with the 126 nightly bump which coincided with a huge increase in m-c from merging in mozilla-mobile. A cron job actually took more than an hour to run, resulting in a job timeout in the middle of processing a libdav1d job.

This resulted in an unexpected CREATED status for a job that never got fully processed. https://bugzilla.mozilla.org/show_bug.cgi?id=1885884 got filed, but never (a) had its bug id entered into the database (the job died write after filing the bug) and therefore (b) got processed further (attach patch, sent to try, etc).

Normally when Updatebot finds a bug with UNEXPECTED_CREATE_STATUS it will comment on the bug - but we didn't have the bug id in the database. Either way, updatebot sets the outcome to UNEXPECTED_CREATED_STATUS and status to DONE.

But I think a missing component is that we don't relinquish anything. And sure enough, after this (single) exception about the unexpected created status, we start getting exceptions: We got more than one non-relinquished job: [<Job id: 370 library: dav1d>, <Job id: 352.... (370 is the unexpected created status, and 352 is the job before that.)

Normally we relinquish the prior bug here right after we've filed a bug for the new update, and added the bug id to the database. (So we were really close to doing this for the dav1d situation above, but changing when we relinquish isn't the correct fix.)

We should also relinquish when we process an unexpected create status. We just have to make sure we do it in a way that doesn't violate any of our sanity assertions.

  1. Relinquishing the newer, unexpected_create job would violate the assertion that the one non-relinquished job is the most recent job.
  2. Relinquishing both jobs would violate the assert that if there is any prior job, we should have one non-relinquished job.

The only choice is to relinquish the older job, as we normally would, and leave the unexpected-create job as non-relinquished.

Or is it? It doesn't make sense to me that we would want to leave the unexpected job as non-relinquished because... it's messed up. I don't want Updatebot trying to go 'fix' things. I want it to leave that messed up bug alone and do nothing. So I think a better fix is to relax the second assertion (you're allowed to have prior jobs and only relinquished jobs) and then relinquish both prior jobs.

cc @mozfreddyb @maltejur

tomrittervg commented 5 months ago

We've got another issue with Opus, and jobs 346 and 360 - both of which are JOBOUTCOME.UNEXPECTED_CREATED_STATUS. I think I didn't debug this one for the entirity of the 125 nightly cycle, resulting in us compounding the issue and now it's extra messy.

VqDsgjlPSOOUES0FEZuBXg is the task where we created 360. It eventually failed like this:

[task 2024-02-29T00:16:27.216Z] [Info]    CommandProvider: Running ['git', 'diff', '--name-status', 'c41afe41f0fb13b80ca79e987a21ea411e1545a7^', 'c41afe41f0fb13b80ca79e987a21ea411e1545a7']
[task 2024-02-29T00:16:27.218Z] [Info]    CommandProvider: Return: 128 Runtime (s): 0
[task 2024-02-29T00:16:27.218Z] [Debug]   CommandProvider: -------
[task 2024-02-29T00:16:27.218Z] [Debug]   CommandProvider: stdout:
[task 2024-02-29T00:16:27.218Z] [Debug]   CommandProvider: 
[task 2024-02-29T00:16:27.218Z] [Debug]   CommandProvider: -------
[task 2024-02-29T00:16:27.218Z] [Debug]   CommandProvider: stderr:
[task 2024-02-29T00:16:27.218Z] [Debug]   CommandProvider: fatal: ambiguous argument 'c41afe41f0fb13b80ca79e987a21ea411e1545a7^': unknown revision or path not in the working tree.
[task 2024-02-29T00:16:27.218Z] Use '--' to separate paths from revisions, like this:
[task 2024-02-29T00:16:27.218Z] 'git <command> [<revision>...] -- [<file>...]'
[task 2024-02-29T00:16:27.218Z] 
[task 2024-02-29T00:16:27.218Z] [Debug]   CommandProvider: ----------------------------------------------
[task 2024-02-29T00:16:27.218Z] [Error]   CommandProvider: Expected a clean process return but got: 128
[task 2024-02-29T00:16:27.218Z] [Error]   CommandProvider:    ( git diff --name-status c41afe41f0fb13b80ca79e987a21ea411e1545a7^ c41afe41f0fb13b80ca79e987a21ea411e1545a7 )

This is what caused the prior job to fail also - which is the one we debugged in the original comment on this bug. So opus is now stuck because we're also trying to do something with revision c41afe41f0fb13b80ca79e987a21ea411e1545a7.