Open webknjaz opened 3 years ago
cc @gundalow @felixfontein :point_up:
Got a case of error: rev-list died of signal 9
. Looks like it's probably bad network and pushing might need retries… Re-labeling succeeded in that case.
INFO:patchback.event_handlers:PR#103 got labeled with "backport-2.15". It needs to be backported to stable-2.15
--
INFO:patchback.event_handlers:PR#103 got labeled with "backport-2.15". It needs to be backported to stable-2.15
INFO:patchback.event_handlers:PR#103 merge commit: 5fe565bb950288324c454285f6140f5311c2b83c
INFO:patchback.github_reporter:Checks API is available
INFO:patchback.event_handlers:Created a temporary dir: `/tmp/ansible--ansible-documentation---stable-2.15---62dcydlm---PR-103.git`
hint: Using 'master' as the name for the initial branch. This default branch name
hint: is subject to change. To configure the initial branch name to use in all
hint: of your new repositories, which will suppress this warning, call:
hint:
hint: git config --global init.defaultBranch <name>
hint:
hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and
hint: 'development'. The just-created branch can be renamed via this command:
hint:
hint: git branch -m <name>
Initialized empty Git repository in /tmp/ansible--ansible-documentation---stable-2.15---62dcydlm---PR-103.git/.git/
INFO:octomachinery.app.routing.webhooks_dispatcher:Got a valid X-GitHub-Event=pull_request with X-GitHub-Delivery=9120f040-21bb-41ee-8940-466b28d07b39 and X-Hub-Signature=sha1=d42f700102854d76308b62a3e9b9f4994a4947c3
INFO:aiohttp.access:10.131.10.1 [13/Jul/2023:20:26:38 +0000] "POST / HTTP/1.1" 200 30508 "-" "GitHub-Hookshot/bfdfc6c"
error: rev-list died of signal 9
error: https://github.com/ansible/ansible-documentation.git did not send all necessary objects
Ref: https://github.com/ansible/ansible-documentation/pull/103#issuecomment-1635960774.
Looks like it's probably bad network and pushing might need retries… Re-labeling succeeded in that case.
https://stackoverflow.com/q/28273123/595220 suggests that since there's no remote:
prefix, it's a local thing. It also makes a case for it being OOM-caused. It also hints that this may be caused when the disk space runs out...
All the resource consumption charts (CPU, RAM, network in and out) spike (smoothly) about 10 minutes prior to the incident.
https://odoo-community.org/groups/contributors-15/contributors-236103 also suggests that this may happen if a container was allocated too little memory.
Today the bot exposed to us a rare behavior reporting that it didn't have access to the target repo. This happened as a result of encountering some issue during
git push
— the corresponding code block reports a "permission error" unconditionally without attempting to analyze what actually happened: https://github.com/sanitizers/patchback-github-app/blob/2c263f7/patchback/event_handlers.py#L165-L180. And it doesn't include any details in the report text (originally — for security reason because it's unsanitized and may leak an access token if implemented poorly).So I want to document the need to improve the traceability by adding the sanitized command output to the reported error.
Also, we may need to implement some error analysis and if it is recoverable, implement retries with https://pypi.org/p/backoff.
Another thing to talk about is
--force-with-lease
. TL;DR we may be better off using just plain--force
.How is this relevant? It's hard to judge without the actual
git push output
but I seem to recall facing a weird race condition with--force-with-lease
back when we were working on https://github.com/ansible-community/collection_migration.I've attempted to re-analyze why it might've been happening and found a way to simulate it. Basically, it's achieved by
So the difference between
--force
and--force-with-lease
that is of interest to us is the fact that the latter attempts to lock the remote ref to make sure that the remote branch replaced is the one the local repo knows about and the push won't unintentionally rewrite the branch contents somebody else pushed to since the lastgit fetch origin
.I think that maybe it's what's happened in the morning — probably a race condition may have been triggered by adding and removing+readding the same label on the same PR at close moments in time. And so one of the invocations of the event handler pushed (created) a backport branch, reported the success and the second one attempted to do the same but failed because of the inability to acquire the branch lock and then reported the failed status on the same check-run.
Refs: