rpm-software-management / rpm

The RPM package manager
http://rpm.org
Other
496 stars 359 forks source link

Examine Compressed Headers #2220

Open ffesti opened 1 year ago

ffesti commented 1 year ago

While the payload is compressed the header is not. This is wasting time and IO whenever a header is loaded - be it form a package or the RPM BD.

For the v6 format we need to examine if compressing the headers yields enough benefits to justify this change.

DemiMarie commented 1 year ago

While the payload is compressed the header is not. This is wasting time and IO whenever a header is loaded - be it form a package or the RPM BD.

For the v6 format we need to examine if compressing the headers yields enough benefits to justify this change.

Compressing the main header is okay, as it is signed. Compressing the signature header is a bad idea unless an extra signature is added after compression.

dralley commented 1 year ago

IMO this would somewhat devalue the RPM format as opposed to something like Debian or Arch packages, which basically just store compressed files in an archive.

A benefit of RPM is that you don't have to do a lot of processing to get at the metadata, so in theory working with the headers can be more efficient. (I actually have no idea how true that is in practice with modern hardware, comparatively speaking). If you put all the metadata into one big compressed blob which forces you to pay the cost of decompression before you can access any of the metadata, it feels like you would end up in the worst of both worlds - a complex specialty file format that doesn't provide the intended benefits of having a complex specialty file format (at least not all of them).

Random-ish data (like signatures and checksums) is not very compressable, anyway, and it sounds like a lot of the recent upwards pressure on header sizes is coming from this. There were some estimations done when there was a proposal about adding file IMA signatures to repo metadata: https://github.com/rpm-software-management/createrepo_c/issues/313#issuecomment-1057223876

File lists OTOH are fairly compressible, so I won't claim the benefit would be zero.

dralley commented 1 year ago

All of which isn't to say that it's not a problem that requires solving, but if it requires a file format revision to solve then maybe we should also examine if a change like this would enable certain other simplifications.

mlschroe commented 1 year ago

Just a data point: when I created ndb I played with putting lzo compressed headers in the database. I thought that decompression would be faster than IO, so the net result would be faster database access. But it turned out that things got slower with compression, so I removed the feature again. (This was quite some time ago, so take this with a grain of salt.)

Compressing the main headers in the package format sounds more promising.

dralley commented 1 year ago

LZO doesn't have particularly fast decompression by modern standards, as far as I'm aware. zstd is 2-3x faster and lz4 is 5x faster according to these numbers https://github.com/facebook/zstd

dralley commented 1 year ago

Here's a script I threw together in 30 minutes for getting a rough estimate of the usefulness

#!/usr/bin/env python

import os
import sys
import createrepo_c as cr

import lz4.frame
import zstd

results = {}

with os.scandir(sys.argv[1]) as entries:
    for entry in entries:
        if not entry.is_file() or not entry.path.endswith(".rpm"):
            continue

        pkg = cr.package_from_rpm(
            entry.path,
            location_href=str(entry.path),
            checksum_type=cr.SHA256,
        )
        header_size_bytes = pkg.rpm_header_end - pkg.rpm_header_start
        with open(entry.path, "rb") as f:
            f.seek(pkg.rpm_header_start)
            header = f.read(header_size_bytes)

        zstd_header = zstd.ZSTD_compress(header)
        lz4_header = lz4.frame.compress(header)

        results[str(entry.path)] = {
            "header_size": header_size_bytes,
            "package_size": pkg.size_package,
            "archive_size": pkg.size_archive,
            "header_size_zstd": len(zstd_header),
            "header_size_lz4": len(lz4_header),
        }

total_size_headers = 0
total_size_packages = 0
total_size_archives = 0
total_size_lz4 = 0
total_size_zstd = 0

print("Results for {} packages".format(len(results)))

for package, data in results.items():
    total_size_headers += data["header_size"]
    total_size_packages += data["package_size"]
    total_size_archives += data["archive_size"]
    total_size_lz4 += data["header_size_lz4"]
    total_size_zstd += data["header_size_zstd"]

print("Average header size as proportion of package total: {:.2f}%".format(total_size_headers / total_size_packages * 100))
print("Average header savings with LZ4 compression: {:.2f}%".format((1 - (total_size_lz4 / total_size_headers)) * 100))
print("Average header savings with ZSTD compression: {:.2f}%".format((1 - (total_size_zstd / total_size_headers)) * 100))
print("Average total savings with LZ4 compressed headers: {:.2f}%".format(
    (1 - (total_size_packages - (total_size_headers - total_size_lz4)) / total_size_packages) * 100)
)
print("Average total savings with ZSTD compressed headers: {:.2f}%".format(
    (1 - (total_size_packages - (total_size_headers - total_size_zstd)) / total_size_packages) * 100)
)

Note:

dralley commented 1 year ago

Notice: just fixed some math bugs in the above script

dralley commented 1 year ago

I ran the script against all the RPMs in RHEL 9 baseos + appstream repos. Results:

[dalley@thinkpad devel]$ python compressed_header_test.py repos/rhel9-combined-pkgs/Packages/ Results for 10644 packages Average header size as proportion of package total: 1.97% Average header savings with LZ4 compression: 53.02% Average header savings with ZSTD compression: 70.65% Average total savings with LZ4 compressed headers: 1.05% Average total savings with ZSTD compressed headers: 1.40%

dralley commented 9 months ago

How one would compress the header section without altering some aspect of the format significantly (in a way that would not be trivial to backport)?

pmatilai commented 9 months ago

You can't. So header compression on package level would have to be an optional feature.

dralley commented 4 months ago

Doesn't seem worth it to me