openSUSE / zypper

World's most powerful command line package manager
http://en.opensuse.org/Portal:Zypper
Other
404 stars 110 forks source link

[feature request] Download multiple packages at the same time #104

Open zutto opened 7 years ago

zutto commented 7 years ago

When doing updates that require downloading of hundreds of packages, it would be great if Zypper supported downloading more than one package at a time.

This would significantly speed up the update process without having to hack a local repository.

Downsides of this as of right now is that most of the repository servers still serve pages under http 1.X protocols and might have to throttle clients based on the amount of connections initiated. But the http servers are moving towards http 2 at very fast rates, and this would mitigate this issue completly if Zypper were to support it as well.

Upsides would be that with home connections being at 1Gb/s speeds these days, mobile connections being at ~5-150Mb/s speeds, and 1Gb/s mobile connections (5g) are coming in 3-4 years to the consumer market, the experience would be much better to the user.

Thoughts? Yay's/nay's? Blockers?

Qantas94Heavy commented 7 years ago

In terms of the feature itself it should speed things up, though I'm not familiar with how exactly we would implement parallel downloading of packages in libzypp and whether that would require any additional dependencies.

In most cases, download and installation are separate phases, so parallel downloading isn't that different to the current situation, though there may be issues for other modes (e.g. download and install).

avindra commented 7 years ago

I like the way that docker handles the parallel downloading of multiple filesystem layers. It will only pull a (configurable?) maximum number of layers in parallel, and even extract layers in order as they become available.

This would be a huge benefit to the openSUSE ecosystem (esp. for Tumbleweed users who update frequently).

liamdawson commented 6 years ago

Connections to my mirrors feel slower in Australia compared to other distros such as Fedora or Ubuntu. I suspect parallel downloads would help make it feel faster, at the least, which is important when I'm doing a 1,200 package zypper dup on Tumbleweed. (It generally seems to take a minimum of a second for each package, regardless of size).

romulasry commented 6 years ago

Definitely would be nice.

romulasry commented 6 years ago

You could use the SHA1 signature on the file to see if it was the same in other repos. That would work.

harish2704 commented 5 years ago

I just wrote a python scripts which will print package urls of zypper dup command. this urls can be downloaded in parallel to cache using cli tools. Hope someone may find it useful https://gist.github.com/harish2704/54d5f68fa606419ec4777847fba9f511

shasheene commented 5 years ago

This is an incredibly important feature. A highly typical OpenSUSE workload on continuous integration servers involve first executing a number of zypper install pkg_name steps in a fresh OpenSUSE environment (such as within a container). My connections to an OpenSUSE mirror via Zypper appeared limited at around 10mbps (regardless of that remote server's utilisation). This wastes a lot of time unnecessarily. The zypper client should try and utilise all available bandwidth (with the server maintaining the responsibility to manage load-balancing bandwidth between clients.)

The official OpenSUSE mirror list suggests mirrors support HTTP, FTP and rsync. I believe rsync is only used for mirror-to-mirror synchronisation (not by Zypper), but I don't know.

To talk about this feature request in concrete terms, here's a description of the parallelism available at multiple places:

  1. Download multiple RPMs concurrently from a single repository
    • This improves total transfer speed if there are a large number of packages
    • Should be easy to implement for HTTP, FTP clients
  2. Download a single RPM with multiple connections (starting at different byte offsets) from a single URL (known as "segmented file-transfer")
    • This improves total transfer speed if there is a single large package.
    • Note: the now-removed ZYPP_ARIA2C environment variable apparently did this.
  3. Download a single RPM with multiple connections (starting at different byte offsets) from multiple repositories (also a segmented file-transfer)

The first two features are unlikely to require much changes. The file list that's presumably generated internally can simply be called by multiple threads. Returning the ability for segmented file-transfer via aria2c is also a great thing to have.

The third feature is likely a huge amount more changes to implement for less net improvement. I would think it would only really be mostly useful for people who have gigabit(s) of bandwidth but no single remote server able to match their connection. From a technical view, given the zypper client has metadata's repo.xml files mean we have the sha1 hashes across N repositories, information so we can generate a URL for each remote repository knowing it's the same file and then use a segmented file-transfer HTTP client application or library to conduct the download.

bzeller commented 5 years ago

Thanks, we already started to think about media backend improvements: https://bugzilla.opensuse.org/show_bug.cgi?id=1125470

As for your points: zypper already does use segmented file transfers from multiple different mirrors by utilizing metalinks. However this is only possible with bigger files and only with the official servers. So we already have some of that implemented. However I think the next step really is to go for parallel downloads. This however will require rather big changes in the media backend and is nothing that can happen over night.

The problem here is not running multiple downloads at the same time, but rather the media backend API not being prepared to get multiple file requests at the same time, as well as the frontends (zypper/yast) that are not prepared to get multiple download progress messages at the same time. So while the very core of the feature would be easy to implement, we also need to adapt lots of other code for this to work.

harish2704 commented 5 years ago

Hi @bzeller , if parallel downloading feature requires changes to core components of Zypper, then as a workaround, it will be helpful if we can simply print the URLs ( for eg: add an cli option --print-urls which will do a dry run without downloading the rpms )

if we can get a list of urls , we can at-least downlad the packages ( in parallel ) to cache and the re run the Zypper command.

Wardrop commented 5 years ago

Agreed. In the age of rolling releases, this is a fairly critical feature. 1400 packages, with a minimum of 2 seconds per package due to latency of each HTTP request, adds up to close to an hour, regardless of the bandwidth you may have at your disposal.

bbhoss commented 5 years ago

zypper install texlive-latex to feel real pain on this. It's basically unusable with IPv6 since the mirror it chooses is over 100ms from me. Not sure if that's my ISP's fault or the mirror, but I have to disable IPv6 on the system for major package installations or system upgrades. Parallel downloads would make this issue much less disruptive.

awerlang commented 4 years ago

I wrote a POC for downloading packages in parallel. With this script a huge update takes a third of the time (only download).

It doesn't:

https://gist.github.com/awerlang/b792a3f908206a90ad58ba559c5400bb

Improvements welcome.

awerlang commented 4 years ago

I improved the POC I posted above, got a 5x speed-up compared to regular zypper dup --download-only.

Now it also does:

bzeller commented 4 years ago

Please note that zypper is also doing a checksum check of every file that its downloading. Your script is not checking those. Since you are downloading the files directly into the cache you are suggesting to zypper "This file is safe and was checked" so the checksum is not checked again.

awerlang commented 4 years ago

@bzeller thanks for the feedback. I'd like to add this missing step, so I've been reading about checksumming the packages. Is it enough to call rpm --checksig *.rpm on each .rpm? Or else, how can I retrieve the checksums? From the repo's primary.xml?

Also, I thought zypper was doing some sort of integrity check on the packages. Because zypper in was reporting an empty cache after an aborted download (original size on disk). Resuming the download with the script, then zypper in again, the file appears as cached.

awerlang commented 4 years ago

I did some testing:

  1. Change one byte in a package from tumbleweed-oss: zypper ignores it, aria notices and re-downloads;
  2. Replace a package with another from the same repository (packman): zypper ignores it, aria re-downloads (not sure why), but zypper still ignores the package (don't know why);
  3. Replace a package with another from another repository: zypper ignores it;

Zypper considered the broken/replaced package as not in cache. rpm -K didn't complain about signatures except for test 1.

  1. Download a random .rpm: rpm -K fails

Anyways, I don't want to hijack the issue. I'm on board on having this feature on zypper as well though.

bzeller commented 4 years ago

@bzeller thanks for the feedback. I'd like to add this missing step, so I've been reading about checksumming the packages. Is it enough to call rpm --checksig *.rpm on each .rpm?

You need to check the checksum from the metadata ( primary.xml ) the signature is a different story.

Or else, how can I retrieve the checksums? From the repo's primary.xml?

Yes, that would be the correct file, but you first would need to parse the repomd.xml to know the exact name of the primary.xml if you pull it from the server, or you'd need to take it from /var/cache/zypp/raw//-primary.xml.gz. But if you take it from raw cache you'd first need to do a refresh ( thats another thing zypper automatically does that costs some time ).

bzeller commented 4 years ago

I did some testing:

1. Change one byte in a package from tumbleweed-oss: zypper ignores it, aria notices and re-downloads;

2. Replace a package with another from the same repository (packman): zypper ignores it, aria re-downloads (not sure why), but zypper still ignores the package (don't know why);

3. Replace a package with another from another repository: zypper ignores it;

Zypper considered the broken/replaced package as not in cache. rpm -K didn't complain about signatures except for test 1.

1. Download a random .rpm: `rpm -K` fails

Anyways, I don't want to hijack the issue. I'm on board on having this feature on zypper as well though.

Once a file is in the cache, zypper will treat it as "have it already downloaded and was checksummed" iirc. .. @mlandres knows that in more detail but iirc it would not download it again...

mlandres commented 4 years ago

Package::cachedLocation considers the checksum, so it should re-load a not matching package. This was introduced because some repos offered the same package (NVRA) but with different content. But basically a file in the cache is expected to be sane. Signature checks performed when downloading are not repeated. If a broken/suspisious file is in the cache we assume the user was notified and accepted using it.

awerlang commented 4 years ago

Thanks for the confirmation. Just for principles, can I instruct zypper to ingest these *.rpm from the local dir into zypper cache? That way there's no bypassing zypper safety checks.

harish2704 commented 4 years ago

Hi, I just wrote another utility to print URLs using zypper. https://gist.github.com/harish2704/fdea058e86f1a1e1b806700b061ade2e

Since libzypp and its python bindings are not maintained, this script can not be used easily .

This utility is a actually two lines of bash script. it derives package url from solver state.

awerlang commented 4 years ago

Hi, I just wrote another utility to print URLs using zypper. https://gist.github.com/harish2704/fdea058e86f1a1e1b806700b061ade2e

Since libzypp and its python bindings are not maintained, this script can not be used easily .

This utility is a actually two lines of bash script. it derives package url from solver state.

The assumption that the architecture is part of the url isn't valid. See Visual Studio Code repository, for example.

savek-cc commented 4 years ago

Is anyone currently working on this?

bernardosulzbach commented 4 years ago

Agreed. In the age of rolling releases, this is a fairly critical feature. 1400 packages, with a minimum of 2 seconds per package due to latency of each HTTP request, adds up to close to an hour, regardless of the bandwidth you may have at your disposal.

I would like to emphasize this argument for this feature request. Not having parallel downloads makes network latency a large bottleneck for a task (downloading a few GBs of static data) that should be latency-insensitive.

bzeller commented 4 years ago

We are working on easing that problem. It requires lots of restructuring, so it takes time.

romulasry commented 4 years ago

Subscribe.

D0048 commented 3 years ago

Any news on the progress of this?

pruebahola commented 3 years ago

any update?

mlandres commented 3 years ago

Slowly / steadily. We're currently merging the new backend code into master. In one of the next libzypp releases it will be possible to enable the backend code via some environment variable. The the new code will take over the downloading.

But note that this will not enhance anything.

This is solely for testing and hardening the new downloader. But while it is still driven by the same old serial workflows there will be probably be no noticabel enhancement. This will be the next step todo.

Mirppc commented 3 years ago

You are kidding... right?

Currently mirrorbrain is sending me to the closest mirror with the lowest ping which happens to be running around 20kb/s. being able to download multiple packages at a single time at 20kb/s will be a HUGE improvement over the current timeoutfest that exist.

More so the option to enable multiple downloads exists in the zypp.conf. It currently does nothing.

Now for the new backend.. will this be in the zypper development repository for testing? If yes then link it here as i will gladly test it.

mlandres commented 3 years ago

@Mirppc the download.max_concurrent_connections option in zypp.conf refers to the number of parallel connections we use internally to download the chunks for a single package if we get a metalink file from the server. Those connections however are visible in the zypper.log only, which makes it hard to spot issues if, like in bsc#1183205, download.opensuse.org delivered poor metalink files. The bugreport contains some hints how to find the connections we use in the log, just in case you want to check it.

tecosaur commented 3 years ago

I just wanted to add here, latency is probably the main factor in updates for me. I'm currently doing a tumbleweed dup with 7200 packages, at 2s/package that's 240 minutes or 4h.

harish2704 commented 3 years ago

Currently I m using dnf package manager in Tumbleweed ( mainly for distribution upgrades. ) dup with zupper is not practical especially if you don't have super-fast internet. Previously I had written a script to pre-download file ( using python bindings of zypper ) But , python bindings of libzypp became unmaintained and then I switched to dnf

teohhanhui commented 3 years ago

HTTP/2 allows TCP connection reuse. That should be a big win to reduce overhead?

tamara-schmitz commented 3 years ago

if this is just about speed. I sped up TW updates noticeably by switching from the official openSUSE downloads to a local mirror: ftp.fau.de. Not only did download speeds for big packages but so did it for smaller ones where connecting may take longer than transferring data. One could investigate there.

Mirppc commented 3 years ago

that would be nice if there where mirrors near me. i let mirrorbrain deal with it even though it goes by lowest ping in my experience.

oshanz commented 2 years ago

New zypper HTTP backend: Another project we have been working on is parallelizing downloads, for this a new async downloader backend was implemented. While it currently won't have massive impact on performance due to the frontend code not requesting files asynchronously, it will do some additional mirror rating and as soon as we update the frontend code will bring more benefits. This can be enabled via setting the env var: ZYPP_MEDIANETWORK=1

https://lists.opensuse.org/archives/list/factory@lists.opensuse.org/thread/YEFSZZY7FRGJPGPWKNCUT3UXQWVENOCL/

the-dem commented 2 years ago

Parallel downloading of the metadata would be nice during a refresh. Even if it is one repo at a time (though multiple at once would be even better), it would speed it up slightly. Reading several repos at once to see if they have changed at the beginning would definitely make things quicker.

13werwolf13 commented 2 years ago

I will add my opinion. Parallel updating of metadata from the repositories will really greatly speed up the work, I have more than 40 repositories added to my system and sometimes I wait for metadata updates longer than working with packages. Parallel downloading of packages will make sense if the user has repositories living on different servers (for example, proprietary nvidia and/or vivaldi repositories are added).

mlandres commented 2 years ago

@13werwolf13 right. We meanwhile offer a tech preview. By now the preview just contains the new downloader code. As the workflows are still serial the preview is currently mainly for hardening the code. But we are continuously going to enhance the stuff and shipp it to TW and via online updates to all Code15 distros.

In addition to the environment variables mentioned in the announcement, you can permanently enable the preview in /etc/zypp/zypp.conf too:

[main]
#techpreview.ZYPP_SINGLE_RPMTRANS=1
techpreview.ZYPP_MEDIANETWORK=1
13werwolf13 commented 2 years ago

In addition to the environment variables mentioned in the announcement, you can permanently enable the preview in /etc/zypp/zypp.conf too:

techpreview.ZYPP_MEDIANETWORK=1

i try ZYPP_MEDIANETWORK=1 today, unfortunately I didn't notice any difference. my experiment looked like this: 1) zypper ref 2) export ZYPP_MEDIA_CURL_IPRESOLVE=4 && zypper dup --allow-vendor-change -d -y 3) export ZYPP_MEDIA_CURL_IPRESOLVE=4 && export ZYPP_MEDIANETWORK=1 && zypper dup --allow-vendor-change -d -y unfortunately I did not save the results of the experiment, but the difference was in the margin of error

Of course I cleaned the cache of packages between step 2 and 3

P.S.: sorry if my poor knowledge of the language will interfere with the understanding of what was said

Mirppc commented 2 years ago

I will add my opinion. Parallel updating of metadata from the repositories will really greatly speed up the work, I have more than 40 repositories added to my system and sometimes I wait for metadata updates longer than working with packages. Parallel downloading of packages will make sense if the user has repositories living on different servers (for example, proprietary nvidia and/or vivaldi repositories are added).

There is also another thing as well. If the closest repos cap out at 10mbit/s for downloading packages but the person is on fiber or a high speed docsis connection where their network speed to the outside world can be 100mbit/s or more, they can have multiple packages download at the same time maxing out their connection better than doing the downloads one at a time. This also is if the repo that Mirrorbrain or the new alternative send you to a fast repo and not one that has the lowest ping but in fact will cap the download speeds at maybe 250kbit/s (i get this a lot).

oshanz commented 2 years ago

Parallel downloading of packages will make sense if the user has repositories living on different servers (for example, proprietary nvidia and/or vivaldi repositories are added).

@13werwolf13 could you please elaborate on that. do you think parallel downloading from a single server is not going to reduce the total download time? is there any limitations that single server can't deliver parallelly? or any bottleneck at protocol level?

13werwolf13 commented 2 years ago

@oshanz no, of course, in most cases, parallel loading from one server also makes sense, but experience shows that in the case of loading from different servers, this makes much more sense and gives a greater speed increase (although considering how much time has passed since then this may not be relevant an experience).

as an experiment, I would also think about the delivery of packages through something like a torrent or DC ++ but with the amendment to check the package from a central server. repositories and mirrors are good, but if I want to download the foo-1.2.3 package weighing more than 500 megabytes, then much faster I download it from three neighbors on the network than from a mirror. I think that those who use for example cross-avr-gcc (any version) which weighs a lot and is in the HOME repository because of which it does not get on the mirrors will agree with me. I don't know if I'm alone or not, but I noticed that in the case of opensuse, the more the package weighs, the lower the download speed (and I'm talking about the speed and not about the time spent, which would be logical).

Mirppc commented 2 years ago

Parallel downloading of packages will make sense if the user has repositories living on different servers (for example, proprietary nvidia and/or vivaldi repositories are added).

@13werwolf13 could you please elaborate on that. do you think parallel downloading from a single server is not going to reduce the total download time? is there any limitations that single server can't deliver parallely? or any bottleneck at protocol level?

I kinda gave an example of this in my post. lets say you have 10 100mb packages to download from one repo which has a per download cap of 1mb/s.

With the current method you download one package at a time at 1mb/s. Depending on speed fluctuations that could take about 17 minutes total.

Now calculate if you have 3 packages downloading at the same time each at 1mb/s per package. This means the total time spent caching packages would be around 6 minutes.

Now expand this to several thousand packages across oss, non-oss, update oss, update non-oss, packman at minimum and you have quite a bit of time savings

Also there is less likely that foo-1.2.3.rpm got rolled back or updated during the time it took to get to downloading it and that rpm package no longer exists on the repo. This has happened to me so often.

teohhanhui commented 2 years ago

delivery of packages through something like a torrent … but with the amendment to check the package from a central server

That check would be redundant as they are already sha-256 verified:

https://blog.libtorrent.org/2020/09/bittorrent-v2/

13werwolf13 commented 2 years ago

@Mirppc

I do not know what are the limitations of the mirrors that are chosen by the system for me, but I noticed that if, being behind the same ip, I start updating two machines and more, then the chance of unsuccessful download increases, I suspect that some servers have a limitation on the number of connections with one ip

13werwolf13 commented 2 years ago

delivery of packages through something like a torrent … but with the amendment to check the package from a central server

That check would be redundant as they are already sha-256 verified:

https://blog.libtorrent.org/2020/09/bittorrent-v2/

I clarified just in case, just by the fact that usually when I talk about downloading software from other users, there is always someone who hits himself in the chest and shouts "IT'S NOT SAFE !!!"

Mirppc commented 2 years ago

@Mirppc

I do not know what are the limitations of the mirrors that are chosen by the system for me, but I noticed that if, being behind the same ip, I start updating two machines and more, then the chance of unsuccessful download increases, I suspect that some servers have a limitation on the number of connections with one ip

This is indeed something that is done and leads to a speed cap to a single IP. I learned this when dealing with repos and downloading iso's direct from the mirrors hosted by Argonne National Lab and Pacific Northwest National Labs here in the US.

13werwolf13 commented 2 years ago

@Mirppc I do not know what are the limitations of the mirrors that are chosen by the system for me, but I noticed that if, being behind the same ip, I start updating two machines and more, then the chance of unsuccessful download increases, I suspect that some servers have a limitation on the number of connections with one ip

This is indeed something that is done and leads to a speed cap to a single IP. I learned this when dealing with repos and downloading iso's direct from the mirrors hosted by Argonne National Lab and Pacific Northwest National Labs here in the US.

Note that in the case of my mirrors this does not affect wget or tools like netboot.xyz, but it does affect zypper and other package managers.