openSUSE / zypper

World's most powerful command line package manager
http://en.opensuse.org/Portal:Zypper
Other
405 stars 110 forks source link

Automatic retry with max retries and timeout in both interactive and non-interactive mode (to work around mirror problems) #420

Open okurz opened 2 years ago

okurz commented 2 years ago

Motivation

Follow-up to #177 and #312 . Often users report problems with zypper aborting with temporary problems to either download packages, download metadata or even to reach hosts supplying packages. Usually a retry helps but this should not be necessary to be done by users. To users the problem appears transient if only individual mirrors show this behaviour and on retry one might hit a mirror that is fine or where metadata and packages are in a different, usable state. Hence zypper should apply the retry internally to help.

Expected result

Would be great if both the interactive as well as the non-interactive mode would also offer automatic retries. Optionally on top: Command line options for e.g. --max-retries and --retry-wait-time (for waiting between the retries)

Further details

See for example the recent report https://bugzilla.opensuse.org/show_bug.cgi?id=1192435

andrii-suse commented 2 years ago

My understanding is that zypper does retry, at least in most of cases it works well (except of some weird situations like in #399 or when media.1/media file has incorrect build number). The problem in https://bugzilla.opensuse.org/show_bug.cgi?id=1192435 was that the file has gone from all the mirrors returned by metalink and retry wouldn't help in that particular case. But it was kind of exceptional situation related to rework of publishing, so it shouldn't happen often.

okurz commented 2 years ago

But it was kind of exceptional situation related to rework of publishing, so it shouldn't happen often.

could be that this particular case was special but I have seen or heard about so many different other cases which to me on the high-level all look comparable

okurz commented 2 years ago

Observation

https://github.com/os-autoinst/openQA/runs/4506367796?check_suite_focus=true#step:3:337 shows

Retrieving repository 'Update repository with updates from SUSE Linux Enterprise 15' metadata [.error]
Repository 'Update repository with updates from SUSE Linux Enterprise 15' is invalid.
[repo-sle-update|http://download.opensuse.org/update/leap/15.3/sle/] Valid metadata not found at specified URL
History:
 - File './repodata/2ec0d1f23dc67a2387797d873e045a03a9b89f0db8c421f59a4b5678fdbc9582-deltainfo.xml.gz' not found on medium 'http://download.opensuse.org/update/leap/15.3/sle/'
 - Can't provide ./repodata/2ec0d1f23dc67a2387797d873e045a03a9b89f0db8c421f59a4b5678fdbc9582-deltainfo.xml.gz

Please check if the URIs defined for this repository are pointing to a valid repository.
Skipping repository 'Update repository with updates from SUSE Linux Enterprise 15' because of the above error.

that seems to have appeared for multiple container builds in the same pipeline. I counted four times. So maybe a simple retry wouldn't help either unless waiting longer so hence my initial proposal with retry and wait-time in between.

A manual podman run --pull=always --rm -it opensuse/leap:15.3 /bin/sh -c 'zypper ref' does not reproduce the problem right now (2021-12-13 13:15Z)

okurz commented 2 years ago

Also related to https://github.com/openSUSE/zypper/issues/399

androniychuk commented 2 years ago

please implement this option! this is especially the case when downloading/installing in a container

Overall download size: 38.2 MiB. Already cached: 0 B. After the operation, additional 166.5 MiB will be used.
Continue? [y/n/v/...? shows all options] (y): y
Retrieving package Mesa-KHR-devel-20.2.4-57.13.x86_64 (1/73), 141.7 KiB ( 10.2 KiB unpacked)
Abort, retry, ignore? [a/r/i/...? shows all options] (a): a
Media source 'https://updates.suse.com/SUSE/Products/SLE-BCI/15-SP3/x86_64/product/' does not contain the desired medium
History:
 - Timeout exceeded when accessing 'https://updates.suse.com/SUSE/Products/SLE-BCI/15-SP3/x86_64/product/media.1/media'.

Problem occurred during or after installation or removal of packages:
Installation has been aborted as directed.
Please see the above error message for a hint.

Rerunning the above passes, sometimes....

Between apt/dnf/zypper, zypper by far has the MOST amount of timeout issues.

It would be very nice to have a command line option, instead of changing something in a config file.

okurz commented 2 years ago

We were hit by this problem (area) again in multiple different realistic cases, e.g. in https://openqa.opensuse.org/tests/2408301#step/openqa_webui/9 where a zypper in fails on a package gone missing from https://download.opensuse.org/repositories/devel:/openQA/openSUSE_Tumbleweed. A retry on the global level of zypper would help because metadata would be refreshed and there is a much higher chance that the corresponding RPM files are there at this time. For sure this problem is more likely to happen with fast changing development repositories but can happen users of various products as well. In another case we failed to retrieve packages from http://download.opensuse.org/repositories/devel:/languages:/go/openSUSE_Leap_15.3/ temporarily and multiple other cases.

Martchus commented 2 years ago

I've just hit this problem when updating my Tumbleweed system with one of my enabled devel repositories. I'd also be in favor of having the retry automated because manually I also don't do anything more sophisticated.

Alternatively, repositories could be designed in a way where old packages are kept around longer (and e.g. are only deleted after one day). That would prevent download errors when the repo metadata is outdated. In addition, if it was guaranteed that new packages are copied first and repo metadata is only updated afterwards, then I suppose download errors when the repo metadata is to new would be prevented as well.

okurz commented 2 years ago

I now also reported https://bugzilla.opensuse.org/show_bug.cgi?id=1200370 about this

bzeller commented 2 years ago

I will implement the --download-max-retries and --download-retry-wait-time switches for failed downloads. In which case zypper will retry downloading a file it it has failed to download.

okurz commented 2 years ago

I will implement the --download-max-retries and --download-retry-wait-time switches for failed downloads. In which case zypper will retry downloading a file it it has failed to download.

Nice. Will you retry just the download of single files or basically restart the complete zypper operation? Because how I understand it is that zypper initially probes repos to find out if the metadata is still up-to-date – unless --no-refresh is specified. Likely it's necessary in some cases to do that part again in a retry as well.

Martchus commented 2 years ago

In some cases it is required to restart the complete zypper operation (when meta-data needs to be refreshed because it became outdated during the download).

But retrying the download is still already an improvement. I've just had to spam r to install some packages from my home repo on Leap 15.4: https://paste.opensuse.org/22619841 - Note that the builds are from yesterday so the mirror should have had enough time to sync. When I've just mentioned the issue on the #opensuse-buildservice IRC channel another user responded quite quickly with a similar experience.

bzeller commented 2 years ago

Nice. Will you retry just the download of single files or basically restart the complete zypper operation? Because how I understand it is that zypper initially probes repos to find out if the metadata is still up-to-date – unless --no-refresh is specified. Likely it's necessary in some cases to do that part again in a retry as well.

No this will just be a simple fix for failed downloads, it won't restart a complete operation. If a repo becomes obsolete during a refresh thats not something zypp should handle imo.

bzeller commented 2 years ago

In some cases it is required to restart the complete zypper operation (when meta-data needs to be refreshed because it became outdated during the download).

But retrying the download is still already an improvement. I've just had to spam r to install some packages from my home repo on Leap 15.4: https://paste.opensuse.org/22619841 - Note that the builds are from yesterday so the mirror should have had enough time to sync. When I've just mentioned the issue on the #opensuse-buildservice IRC channel another user responded quite quickly with a similar experience.

This is weird though, that means for every attempted download there the server responded with some errror and then it suddenly worked on second try.

bzeller commented 2 years ago

This bug might be related: https://bugzilla.opensuse.org/show_bug.cgi?id=1200425 There currently might be a issue with the Metalink handling on servers, see my comment https://bugzilla.opensuse.org/show_bug.cgi?id=1200425#c10

Martchus commented 2 years ago

It could be related, indeed.

If a repo becomes obsolete during a refresh thats not something zypp should handle imo.

From a user's perspective it would be quite nice if zypper could handle it, at least in non-interactive mode. Otherwise one needed to add a manual retry-loop around the zypper call in all places where zypper is supposed to run unattended.

okurz commented 2 years ago

From a user's perspective it would be quite nice if zypper could handle it, at least in non-interactive mode. Otherwise one needed to add a manual retry-loop around the zypper call in all places where zypper is supposed to run unattended.

Yes, that is the main point. From a design point of view I think zypper is behaving correctly and fine already. But as zypper is also the main user interface for package management there is no other joint layer that could handle sporadic download and installation problems so we leave it to users and system management software like salt/ansible/scripting to handle such problems when zypper could cover such cases much more conveniently. Hence this is not a "regression" but a true "feature request".

JanZerebecki commented 2 years ago

https://github.com/openSUSE/zypper/issues/337 is also an earlier report about a part of refresh that wasn't retried.

Duncan1224 commented 2 years ago

Is there a workaround? I'm currently upgrading two Tumbleweed machines from snapshot 20220829 to 20220831, and Error code: Curl error 55 Error message: Connection died, tried 5 times before giving up show every few minutes, download speed is very slow as well. It took me forever to manually type r to retry for my two machines. Add adding download.max_silent_tries = 0 to /etc/zypp/zypp.conf dosen't work, it showed the same error. Not sure if this is relevent, I live in Taiwan, but according to openSUSE:Mirrors I do not have to select a mirror myself.

YellowApple commented 2 years ago

I'm encountering a similar issue (chronic intermittent "Curl error 16" failures when retrieving packages, and I've got 2,720 packages to download on the very Tumbleweed machine on which I'm typing this so I ain't about to sit there typing r every 50 packages lol). Potential workaround (as root, obviously):

zypper --non-interactive dup --download-only
while [ $? -ne 0 ]; do
    echo "Download failed; retrying..."
    sleep 1
    zypper --non-interactive dup --download-only
done
zypper dup

Seems to be doing the trick (it's indeed retrying, so hopefully I can go to bed and wake up to a successfully-dup'd system), though obviously it'd be nicer for zypper to just automatically assume r.

Duncan1224 commented 2 years ago

I'm encountering a similar issue (chronic intermittent "Curl error 16" failures when retrieving packages, and I've got 2,720 packages to download on the very Tumbleweed machine on which I'm typing this so I ain't about to sit there typing r every 50 packages lol). Potential workaround (as root, obviously):

zypper --non-interactive dup --download-only
while [ $? -ne 0 ]; do
    echo "Download failed; retrying..."
    sleep 1
    zypper --non-interactive dup --download-only
done
zypper dup

Seems to be doing the trick (it's indeed retrying, so hopefully I can go to bed and wake up to a successfully-dup'd system), though obviously it'd be nicer for zypper to just automatically assume r.

Thanks a lot. I definitely need it in the future big upgrade.

okurz commented 1 year ago

openSUSE Tumbleweed also has the tool retry so the above command can be simplified to something like

retry -- zypper --non-interactive dup --download-only
Werkov commented 1 year ago

Is this solved with https://github.com/openSUSE/libzypp/pull/433? At least on the client side. If not what's remaining?

I know about two more instances not bound to curl errors that would better be tackled on the server side. I'm trying to figure out what component report this to.

andrii-suse commented 1 year ago

I know about two more instances not bound to curl errors that would better be tackled on the server side. I'm trying to figure out what component report this to.

There were scalability problems around those days at download.opensuse.org, so I would say you do not need to report that. If you see an issue again or still want to hear feedback about those incidents - feel free to report to admin at opensuse org or other channels mentioned here: https://en.opensuse.org/openSUSE:Heroes#Communication