rpm-software-management / createrepo_c

C implementation of the createrepo.
http://rpm-software-management.github.io/createrepo_c
GNU General Public License v2.0
103 stars 93 forks source link

proposal: Change package metadata #354

Open RishabhSaini opened 1 year ago

RishabhSaini commented 1 year ago

The proposal is contained in a hackmd document. cc @cgwalters

Kindly let us know your perspective on this @kontura and/or other owners of this repo

dralley commented 1 year ago

Can you give a general overview of how rpm-ostree is currently using the existing RPM metadata format, and how it is using its own special metadata (not specific to this particular change)? Is it equally dependent on RPM repo metadata, or is it independent, but you need a way to make it easier to produce from existing Fedora / RHEL / etc. repos without needing to coordinate standing up a set of separate services across all of those distros?

cgwalters commented 1 year ago

Can you give a general overview of how rpm-ostree is currently using the existing RPM metadata format,

rpm-ostree uses libdnf, the same as dnf.

and how it is using its own special metadata (not specific to this particular change)?

There is no special metadata today. We're talking about adding some new metadata about the rate of change of RPMs, which would not be used by clients by default. It'd be used by build tooling.

Just for the record, I'm copy-pasting from the hackmd below:


proposal: package change metadata

We're working on bodhi-scraper which is part of larger effort in rpm-ostree to optimize packing images.

We require a metadata file (frequencyUpdateInfo.json) in the repodata of every (Fedora/RHEL/SCOS/Kionite) repository. This file will contain the list of all updates to all of the packages of the specific release.

These list of updates are more comprehensive than those present in updateinfo.xml.

We then combine the frequencyUpdateInfo.json of all the current and pending releases and process it to create a file.

Since this file will be required for all rpm-ostree based Linux distributions, we wanted the architecture to integrate with createrepo_c to make the implementation more general.

Option 1: Inject this into primary.xml

Since this is a relatively small amount of additional metadata per package, we could add it to the primary package metadata. primary.xml is already enormous.

Option 2: Add a new updatemeta.json

We could introduce a new metadata file (JSON since this is 2020s) that contains this metadata instead.

Note that either option implies freezing (at least the first version of) the data shipped.

dralley commented 1 year ago

rpm-ostree uses libdnf, the same as dnf. There is no special metadata today.

I was expecting that ostree-specific metadata is involved somewhere in the chain, but if not, apologies.

We're talking about adding some new metadata about the rate of change of RPMs, which would not be used by clients by default. It'd be used by build tooling.

Not used by clients by "default", or not used at all?

cgwalters commented 1 year ago

I was expecting that ostree-specific metadata is involved somewhere in the chain, but if not, apologies.

rpm-ostree uses ostree by default; no rpm metadata at all is fetched. https://fedoraproject.org/wiki/Changes/OstreeNativeContainerStable is in the progress of s/ostree/containers/.

Not used by clients by "default", or not used at all?

client = dnf here basically. rpm-md fetches are lazy (usually) - clients only fetch what they care about, except for mirroring. But we obviously intend to use this additional metadata for build tooling which generates container images (usually, server side).

dralley commented 1 year ago

rpm-ostree uses ostree by default; no rpm metadata at all is fetched.

Well, that is what I was asking :) Basically I'm just trying to figure out if it is orthogonal to the actual client concerns w/r/t RPM metadata (and you just want it to be present alongside the repo purely for helping out the build tooling) or if it's intertwined.

Because you can have the metadata at a specified place in the repo without necessarily having it be in repodata and registered in repomd.xml. Like kickstart trees are.

Also you may already know this but as RHEL doesn't use createrepo_c, they will be a "special snowflake" no matter what.

cgwalters commented 1 year ago

(and you just want it to be present alongside the repo purely for helping out the build tooling)

Basically this. It's data which is strongly associated with the set of RPMs, and having some sort of external data storage for it creates problems around "lifecycling" this data with the packages.

Because you can have the metadata at a specified place in the repo without necessarily having it be in repodata and registered in repomd.xml. Like kickstart trees are.

Hmm, true. That's definitely an option for PoC work here at least!

Also you may already know this but as RHEL doesn't use createrepo_c, they will be a "special snowflake" no matter what.

I didn't know that...exciting. What is it? Pulp?

cgwalters commented 1 year ago

@RishabhSaini I think basically what we can do for PoC work here is:

  1. Test out creating a copy of the fedora rpm-md repo or a subset even
  2. Inject the frequencyinfo.json file into that
  3. Teach rpm-ostree to try fetching it from the same location as the input repos
  4. Use it if it exists

Once that's done...I'm sure we could ask Fedora infra to try adding this data just manually...maybe have a process that pulls it from a git repo?

That I think the part that requires the most code is step 3, but it shouldn't be too bad.

RishabhSaini commented 1 year ago

Test out creating a copy of the fedora rpm-md repo or a subset even

Does this mean creating a zero sized payload new rpm whose use is just to contain the appropriate metadata (frequencyinfo.json) needed in rpm-ostree?
Then this rpm would need to be published for rpm-ostree to consume

cgwalters commented 1 year ago

rpm-md repositories are just regular files served by a (usually static) webserver. I wasn't thinking we'd make a new rpm, but literally just drop frequencyinfo.json alongside the other repodata files (e.g. the files in this path).

Maybe actually what would work best is to support finding the frequency information in a separate rpm-md repository too...then we could do e.g.:

$ mkdir testrepo
$ cd testrepo
$ echo '{ dummy frequency info }' > frequencyinfo.json
$ createrepo_c 

Then point rpm-ostree at it via a repo file like

[testrepo]
baseurl=file:///path/to/testrepo

etc.

RishabhSaini commented 1 year ago

Okay thanks for the help!

RishabhSaini commented 1 year ago

For easy reference, I will refer Name of yum repo: frequency.repo Name of rpmmd repo containing frequencyinfo.json: frequencyRepo

As outlined in https://github.com/fedora-infra/bodhi/pull/5172, the repodata will still need to contain a more comprehensive list of updates than updateinfo.xml called as FrequencyUpdateInfoMetadata.json

Then point rpm-ostree at it via a repo file like

To implement this frequency.repo will need to be added into fcos-config, so COSA in its build scripts can add it to the /etc/yum.repos.d when creating a new release of FCOS for it to be searchable by rpm-ostree.

Will the frequencyRepo be hosted somewhere (github?) or just kept locally as a folder? How will updates to the repo work? When bodhi-scraper is done generating an updated version of updateinfo.json the file in frequencyRepo will need to be replaced and then createrepo_c needs to be run to update the repomd.xml and checksums. How and where will this workflow be handled?

j-mracek commented 1 year ago

I am really sorry but I am simply lost here. Let me summarize what I understand. I understand that rpm-ostree is looking to resolve a problem, but it looks like that the issue is related to building images/containers for RPM-OSTREE. I don't know whether the metadata will contain some unique or additional information that is not present in RPM or only in METADATA. I don't know whether the proposal is resolving performance issue or something else.

If new metadata are only used internally, I would not recommend to include them in metadata. We have an experience with module.yaml that contains information that was completely not informative to end users and only infrastructure use it for internal purpose. Therefore I would like to avoid it if it is possible.

There is also a problem with propagation of the new type of metadata on user side. Again we learn that with modules where customers regenerate repositories in their workflows and by that way additional metadata are dropped.

Please don't take my note as a negative reaction. I just want to say that I have not enough information and I would like to share our experience with the new type of metadata that are essential for distribution.

cgwalters commented 1 year ago

Thanks for the reply! The comparison with modulemd is actually quite similar indeed.

In the case of this metadata, unlike modulemd it's not essential. Like modulemd, it really helps if it goes where the rpms go by default.

In the end, this data I think is going to be very small it's just a historical relative frequency of the package update; we discussed trying to insert it into primary.xml. But that's slightly messy because it's not actually something that comes from the RPM headers.