[Releaser] Implement a retry mechanism to survive random failures caused by github

kikmon commented 2 years ago

I'm getting some random failures when publishing a package, and a retry fixes the issue This is causing a lot of noise in the pipeline, so is it possible to add some retry policies to the releaser action?

The action call is very simple, pushing 3 small (2.5Mb) zip files

Here's the kind of error I'm getting Post "https://uploads.github.com/repos/kikmon/huc/releases/67601582/assets?label=&name=huc-2022-05-31-Darwin.zip": http2: client connection force closed via ClientConn.Close Traceback (most recent call last): File "/releaser.py", line 187, in check_call(cmd, env=env) File "/usr/local/lib/python3.9/subprocess.py", line 373, in check_call raise CalledProcessError(retcode, cmd)

Paebbels commented 2 years ago

@kikmon the link doesn't work. Can you post your pipeline/job-log link?

The infrastructure of GitHub is not very stable. From time to time we see a lot of issues with their network.

I'll check with @umarcor how to solve this problem.

Paebbels commented 2 years ago

Oh one question: Is it about releasing to GitHub Release Pages or releasing to PyPI? Just to be specific.

kikmon commented 2 years ago

Sorry about the missing long link. Here it is :) https://github.com/kikmon/huc/runs/6663199807?check_suite_focus=true

It is just a simple github release, no Pypi involved here.

Doing retries at the yaml level doesn't seem to be natively supported, so it would be really appreciated if the releaser action could be more resilient to infra glitches :)

kikmon commented 2 years ago

and I just got a another error (Post "https://uploads.github.com/repos/kikmon/huc/releases/68184973/assets?label=&name=huc-2022-06-01-Darwin.zip": read tcp 172.17.0.2:46068->140.82.113.14:443: read: connection reset by peer Traceback (most recent call last):) Seems weird to have these errors happen so frequently. Could it be related to adding multiple zip files to the release ?

umarcor commented 2 years ago

@kikmon, @epsilon-0, this is an annoying issue that has been bugging us since this Action was created. At first, we used the GitHub API through PyGithub. It failed very frequently. Then, we changed to using the GitHub CLI (https://github.com/pyTooling/Actions/commit/459faf880a921b35af298613cb89c30161815fc7). That reduced the frequency of failures, but they are still common. I believe it's because of stability/reliability of the free infrastructure provided by GitHub. I find that small files rarely fail, but larger ones which need to keep the connection alive are tricky.

A few months ago, GitHub added the feature to restart individual jobs in the CI runs. Hence, the strategy I've been following is to have all the "assets" uploaded as artifacts and then have a last job in the workflow which just picks them and uploads them to the release through the releaser. When a failure occurs, that job needs to be restarted only.

Nonetheless, I of course want to improve the reliability of the releaser Action. I think that retry won't always work. Precisely because the feature I explained in the previous paragraph, I do manually restart the CI in https://github.com/ghdl/ghdl. Sometimes it works, but rather frequently it is not effective. The infrastucture is not reliable for some minutes/hours and I need to wait until later or the next day to restart. As a result, when implementing a retry strategy, we should consider that retrying multiple times in a few minutes might be worthless. Instead, large wait times should be implemented. That can be complex, because workflows might be running on the 6h limit, so there might not be time to wait until the API is stable again. We can either:

Allow users to provide the sequence of wait times through an option/input,
and/or, encourage an strategy based on using a sibling job for the releaser, which can be triggered.
- Yet, I'm not sure about the default token being able to trigger other workflows. When I last used/implemented it 1-2y ago, a PAT was required.

kikmon commented 2 years ago

Thanks for the explanations. I think retying a few times would be better than no retry at all, without going up to the 6 hours limit :) I'd be curious to see if it really helps. Manually babysitting a flow is a bit annoying when trying to automate a pipeline. :) There are many retry strategies that can be used, but what about exposing a few simple options like the number of retries, or the max amount of time to wait before failing for real. ? As for my case, the releaser part of the pipeline is doing that already. Only fetching the artifacts from the previous jobs and then calling releaser. I've been exploring the wretry action, but it doesn't play nicely with the Releaser action syntax

Samrose-Ahmed commented 2 years ago

Seeing Post "https://uploads.github.com/repos/matanolabs/matano/releases/75878670/assets?label=&name=matano-macos-x64.sh": unexpected EOF in the action.

Example of failed build.

Paebbels commented 3 months ago

I'm open to accept pull requests.

Please also see #82.

pyTooling / Actions

[Releaser] Implement a retry mechanism to survive random failures caused by github #48