Drop "cpio" libraries and write something semi-custom, because RPM doesn't use vanilla CPIO

dralley commented 1 year ago

See the "Payload" section of the website: https://rpm-software-management.github.io/rpm/manual/format.html

Payload

The Payload is currently a cpio archive, gzipped by default. The cpio archive type used is SVR4 with a CRC checksum.

As cpio is limited to 4 GB (32 bit unsigned) file sizes RPM since version 4.12 uses a stripped down version of cpio for packages with files > 4 GB. This format uses 07070X as magic bytes and the file header otherwise only contains the index number of the file in the RPM header as 8 byte hex string. The file metadata that is normally found in a cpio file header - including the file name - is completely omitted as it is stored in the RPM header already.

So, we should fork cpio-rs (providing the appropriate credits of course), strip it down to the subset we need, and change the magic bytes constant.

Luckily the CPIO format is pretty simple and the library only a few hundred lines, so it's not a big deal.

Subsequently we need to change the PAYLOADFORMAT tag, but upstream RPM still uses cpio as the name, so we'll have to wait until they pick something.

drahnr commented 1 year ago

A real concern: Do we want to support rpms with cpio-like archives larger than 4GB? It feels like we pull in a lot of pain for supporting an antipattern? Are there use-cases that are idiomatic that require rpms larger than 4GB?

dralley commented 1 year ago

@drahnr The example that typically comes up is games, which often include many large assets, or ML models, or their training data. In practice those are rarely distributed as system packages but it is possible and has been done.

drahnr commented 1 year ago

My question: Are we anticipating this crate being used for games, using rpm-rs rather than rpmbuild? Resources are limited, and this doesn't hit me as good return on investment of those.

dralley commented 1 year ago

It's not just a matter of writing but also reading. I'm not sure I want to assume that nobody will ever want to use this crate to process the contents of existing such RPMs.

I don't know that it's such a drain on resources. cpio is pretty simple, the code for both reading and writing them is only about 400 lines excluding tests and is pretty stable.

drahnr commented 1 year ago

Tbh, I'd prefer we create a separate rpm-cpio in the org, rather than moving it into the codebase, and just replace the dependency. Does that sound fair? We can then go forward and rebase on any upstream changes as needed rather than having to backport code manually.

newpavlov commented 1 year ago

I also think that a separate crate would be a better approach. Maybe you should create a repository for it?

dralley commented 1 year ago

I'm a bit lukewarm on having a separate crate, because I can't think of anything apart from an RPM parser which would want to parse RPM payloads. So it would be a separate crate that we would be the only users of, probably ever.

drahnr commented 1 year ago

I am mostly thinking operationally: applying upstream changes would be as easy as a git rebate or merge. I couldn't care less if we stay the only user if it simplifies the maintenence

dralley commented 1 year ago

I don't think there will be any maintenance, the library is "finished" and hasn't seen any commits in a year. CPIO is very simple so there are unlikely to be any bugs.

drahnr commented 10 months ago

We haven't reached a conclusion here, my preference is still on forking to rpm-rs/cpio-rpm and using that.

dralley commented 10 months ago

I still have the opposite preference, tbh :man_shrugging:. It's very difficult for me to imagine the supposed maintenance benefit repaying itself against having a separate crate which nobody but this particular library will ever use.

Since the new payload format removes nearly all of the metadata from the archive (because it's duplicated in the RPM header), you can do very little with the payload without also reading the RPM header. So the obvious thing to do is for us to just provide an API for that directly from this crate, since it would be pretty much the only useful way to use that code.

There is another development since we last had the discussion, which is that RPMv6 plans to use only the "new" payload scheme, so it won't be relegated to just packages with files >4gb anymore, it will eventually be all packages.

That is mentioned under the "Payload" section here: https://github.com/rpm-software-management/rpm/discussions/2374

rpm-rs / rpm

Drop "cpio" libraries and write something semi-custom, because RPM doesn't use vanilla CPIO #108

Payload