rpm-software-management / librepo

A library providing C and Python (libcURL like) API for downloading packages and linux repository metadata in rpm-md format
http://rpm-software-management.github.io/librepo/
GNU Lesser General Public License v2.1
74 stars 90 forks source link

Don't fsync() in checksum #297

Open stewartsmith opened 5 months ago

stewartsmith commented 5 months ago

This gives a major boost in librepo performance. For a reposync of an Amazon Linux 2023 x86-64 repository on a m5n.16xlarge EC2 instance with a 500MB/sec 3000IOP EBS volume, this alone reduces run time by 30 seconds of wall time, and gets reposync nearly using a whole core rather than only two thirds of one.


For reference, my benchmarking has been done on a m5n.16xlarge EC2 instance to the in-region S3 buckets as well as to the CDN repositories. That instance type has 256GB memory, a 75Gbit network connection, and is a 64 core Cascade Lake system. The root volume is a 256GB gp3 EBS volume with 500MB/sec of IO and 3000 IOPs.

The background of this is that a lot of EC2 instances don't live that long (relatively speaking), and never install RPMs except on launch - so all the time-to-install RPMs is time spent scaling up a system that could be better served by running the customer workload.

Goes well when paired with https://github.com/rpm-software-management/librepo/pull/294 and https://github.com/rpm-software-management/librepo/pull/295 and https://github.com/rpm-software-management/librepo/pull/296


What I'm not entirely sure of here is the other implications of this change - as in, what is relying on this checksum being crash safe, and should we instead re-compute it sometimes?

I'm open to putting this behind an ifdef or something if that seems safer. I'd love input here.

stewartsmith commented 5 months ago

To give an idea of what these four PRs combined do, on the same machine, we take the original librepo doing a reposync of AL2023 x86-64 repositories from 1min42s down to 1min08sec.

stewartsmith commented 5 months ago

As an exercise, I tried removing the computing and writing of the checksum along with tweaking the max number of connections. This enabled me to get a peak of around 1.1GB/sec (to /dev/shm... disk IO was starting to become a limiting factor) when reposyncing Fedora - ending up in 1min30s to sync latest packages from fedora 39 x86-64 repos.

It may be worth considering an alternative / option to the checksum.