rpm-software-management / librepo

A library providing C and Python (libcURL like) API for downloading packages and linux repository metadata in rpm-md format
http://rpm-software-management.github.io/librepo/
GNU Lesser General Public License v2.1
74 stars 90 forks source link

Avoid libc buffered IO #294

Open stewartsmith opened 5 months ago

stewartsmith commented 5 months ago

the FILE related IO functions in libc do buffering inside the C library, and are generally less performant than using the file descriptor based open()/read()/write() functions.

The FILE based approach somewhat limits the maximum throughput of librepo, as well as increases CPU usage. In my benchmarks of a reposync of the Amazon Linux 2023 x86-64 repositories, this move to file descriptor based IO saves about 1 second of user time, and .5 seconds of system time, for a wall clock time benefit of a few seconds (102s vs 99s).

stewartsmith commented 5 months ago

For reference, my benchmarking has been done on a m5n.16xlarge EC2 instance to the in-region S3 buckets as well as to the CDN repositories. That instance type has 256GB memory, a 75Gbit network connection, and is a 64 core Cascade Lake system. The root volume is a 256GB gp3 EBS volume with 500MB/sec of IO and 3000 IOPs.

The background of this is that a lot of EC2 instances don't live that long (relatively speaking), and never install RPMs except on launch - so all the time-to-install RPMs is time spent scaling up a system that could be better served by running the customer workload.

ppisar commented 5 months ago

In general, I like your patch, but please address the different semantics I pointed in-line.

stewartsmith commented 5 months ago

Thanks for the eyes on it. I can clean up the review comments, and spin a rev2.

stewartsmith commented 1 month ago

I think I've managed to address the comments and ensure correct behavior in (hopefully) all error conditions.