Improve logging GitHub push errors

webknjaz commented 3 years ago

Today the bot exposed to us a rare behavior reporting that it didn't have access to the target repo. This happened as a result of encountering some issue during git push — the corresponding code block reports a "permission error" unconditionally without attempting to analyze what actually happened: https://github.com/sanitizers/patchback-github-app/blob/2c263f7/patchback/event_handlers.py#L165-L180. And it doesn't include any details in the report text (originally — for security reason because it's unsanitized and may leak an access token if implemented poorly).

So I want to document the need to improve the traceability by adding the sanitized command output to the reported error.

Also, we may need to implement some error analysis and if it is recoverable, implement retries with https://pypi.org/p/backoff.

Another thing to talk about is --force-with-lease. TL;DR we may be better off using just plain --force.

How is this relevant? It's hard to judge without the actual git push output but I seem to recall facing a weird race condition with --force-with-lease back when we were working on https://github.com/ansible-community/collection_migration.

I've attempted to re-analyze why it might've been happening and found a way to simulate it. Basically, it's achieved by

$ git commit --allow-empty -m 'Empty commit'
[branch-name fa821df7] Empty commit
$ for _ in {1..5}; do git push origin HEAD --force-with-lease &; done
[2] 2107671
[3] 2107672
[4] 2107675
[5] 2107677
[6] 2107684

Enumerating objects: 1, done.
Counting objects: 100% (1/1), done.
Writing objects: 100% (1/1), 850 bytes | 850.00 KiB/s, done.
Total 1 (delta 0), reused 0 (delta 0), pack-reused 0
Enumerating objects: 1, done.
Counting objects: 100% (1/1), done.
Writing objects: 100% (1/1), 850 bytes | 850.00 KiB/s, done.
Total 1 (delta 0), reused 0 (delta 0), pack-reused 0
Enumerating objects: 1, done.
Counting objects: 100% (1/1), done.
Writing objects: 100% (1/1), 850 bytes | 850.00 KiB/s, done.
Total 1 (delta 0), reused 0 (delta 0), pack-reused 0
Enumerating objects: 1, done.
Counting objects: 100% (1/1), done.
Writing objects: 100% (1/1), 850 bytes | 425.00 KiB/s, done.
Total 1 (delta 0), reused 0 (delta 0), pack-reused 0
Enumerating objects: 1, done.
Counting objects: 100% (1/1), done.
Writing objects: 100% (1/1), 850 bytes | 425.00 KiB/s, done.
Total 1 (delta 0), reused 0 (delta 0), pack-reused 0
To ssh://github.com/user/repo.git
 ! [remote rejected]   branch-name -> branch-name (cannot lock ref 'refs/heads/branch-name': is at fa821df7897ae866da9e72b28c404cdfcc92760e but expected b5f3ae57debb9ef018aedc92dd0a3704a1d06fa7)
error: failed to push some refs to 'ssh://github.com/user/repo.git'

[6]  + 2107684 exit 1     git push origin HEAD --force-with-lease
To ssh://github.com/user/repo.git
 ! [remote rejected]   branch-name -> branch-name (cannot lock ref 'refs/heads/branch-name': is at fa821df7897ae866da9e72b28c404cdfcc92760e but expected b5f3ae57debb9ef018aedc92dd0a3704a1d06fa7)
error: failed to push some refs to 'ssh://github.com/user/repo.git'
[2]    2107671 exit 1     git push origin HEAD --force-with-lease

 To ssh://github.com/user/repo.git
 ! [remote rejected]   branch-name -> branch-name (cannot lock ref 'refs/heads/branch-name': is at fa821df7897ae866da9e72b28c404cdfcc92760e but expected b5f3ae57debb9ef018aedc92dd0a3704a1d06fa7)
error: failed to push some refs to 'ssh://github.com/user/repo.git'

[3]    2107672 exit 1     git push origin HEAD --force-with-lease

 To ssh://github.com/user/repo.git
 ! [remote rejected]   branch-name -> branch-name (cannot lock ref 'refs/heads/branch-name': is at fa821df7897ae866da9e72b28c404cdfcc92760e but expected b5f3ae57debb9ef018aedc92dd0a3704a1d06fa7)
error: failed to push some refs to 'ssh://github.com/user/repo.git'

[4]  - 2107675 exit 1     git push origin HEAD --force-with-lease

 To ssh://github.com/user/repo.git
   b5f3ae57..fa821df7  branch-name -> branch-name

[5]  + 2107677 done       git push origin HEAD --force-with-lease

So the difference between --force and --force-with-lease that is of interest to us is the fact that the latter attempts to lock the remote ref to make sure that the remote branch replaced is the one the local repo knows about and the push won't unintentionally rewrite the branch contents somebody else pushed to since the last git fetch origin.

I think that maybe it's what's happened in the morning — probably a race condition may have been triggered by adding and removing+readding the same label on the same PR at close moments in time. And so one of the invocations of the event handler pushed (created) a backport branch, reported the success and the second one attempted to do the same but failed because of the inability to acquire the branch lock and then reported the failed status on the same check-run.

Refs:

webknjaz commented 3 years ago

cc @gundalow @felixfontein :point_up:

webknjaz commented 1 year ago

Got a case of error: rev-list died of signal 9. Looks like it's probably bad network and pushing might need retries… Re-labeling succeeded in that case.

INFO:patchback.event_handlers:PR#103 got labeled with "backport-2.15". It needs to be backported to stable-2.15
--

INFO:patchback.event_handlers:PR#103 got labeled with "backport-2.15". It needs to be backported to stable-2.15
INFO:patchback.event_handlers:PR#103 merge commit: 5fe565bb950288324c454285f6140f5311c2b83c
INFO:patchback.github_reporter:Checks API is available
INFO:patchback.event_handlers:Created a temporary dir: `/tmp/ansible--ansible-documentation---stable-2.15---62dcydlm---PR-103.git`
hint: Using 'master' as the name for the initial branch. This default branch name
hint: is subject to change. To configure the initial branch name to use in all
hint: of your new repositories, which will suppress this warning, call:
hint: 
hint:   git config --global init.defaultBranch <name>
hint: 
hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and
hint: 'development'. The just-created branch can be renamed via this command:
hint: 
hint:   git branch -m <name>
Initialized empty Git repository in /tmp/ansible--ansible-documentation---stable-2.15---62dcydlm---PR-103.git/.git/
INFO:octomachinery.app.routing.webhooks_dispatcher:Got a valid X-GitHub-Event=pull_request with X-GitHub-Delivery=9120f040-21bb-41ee-8940-466b28d07b39 and X-Hub-Signature=sha1=d42f700102854d76308b62a3e9b9f4994a4947c3
INFO:aiohttp.access:10.131.10.1 [13/Jul/2023:20:26:38 +0000] "POST / HTTP/1.1" 200 30508 "-" "GitHub-Hookshot/bfdfc6c"
error: rev-list died of signal 9
error: https://github.com/ansible/ansible-documentation.git did not send all necessary objects

Ref: https://github.com/ansible/ansible-documentation/pull/103#issuecomment-1635960774.

webknjaz commented 1 year ago

Looks like it's probably bad network and pushing might need retries… Re-labeling succeeded in that case.

https://stackoverflow.com/q/28273123/595220 suggests that since there's no remote: prefix, it's a local thing. It also makes a case for it being OOM-caused. It also hints that this may be caused when the disk space runs out...

All the resource consumption charts (CPU, RAM, network in and out) spike (smoothly) about 10 minutes prior to the incident.

webknjaz commented 1 year ago

https://odoo-community.org/groups/contributors-15/contributors-236103 also suggests that this may happen if a container was allocated too little memory.

sanitizers / patchback-github-app

Improve logging GitHub push errors #18