osbuild does not produce images with populated dnf state database

osbuild / osbuild

Build-Pipelines for Operating System Artifacts

https://www.osbuild.org

Apache License 2.0

211 stars 113 forks source link

osbuild does not produce images with populated dnf state database #455

Open Conan-Kudo opened 4 years ago

Conan-Kudo commented 4 years ago

Since #328, osbuild has split its software installation into two stages: sources for fetching content using DNF, and rpm for installing them into the target image environment. This would be fine, except... the rpm stage doesn't use DNF to install.

This is actually a problem, since it means that the generated images now lack the DNF state database information that is used later for providing information to make intelligent decisions with the system software in future transactions. For example, the lack of any state information means that dnf autoremove is fundamentally broken and will always do the wrong thing, since we don't have packages installed via DNF so that things are marked as user-installed or dep-installed accordingly.

Additionally, if modular content is installed this way, we now have a situation where DNF is broken in the target image because the failsafe mechanism that was requested for RHEL modules will cause DNF to choke since there will be a situation where you have "modular" packages installed without the corresponding module metadata.

Of course, if you're producing images with no package manager, then this isn't a problem. Or if you aren't using modular content, then the damage is limited. But if you're building custom RHEL 8 images, then this is a problem.

Now, reading back through the history of why this happened, it looks like the goal was to avoid requiring network access for the build stages, presumably to provide a mechanism in which all the inputs could be archived and replayed to generate the same image reproducibly. This is definitely an admirable goal.

My suggestion would be to do the following:

At the sources stage, depsolve for the content you need, and fetch it. Then generate a rpm-md repository.
At the package install stage, configure DNF to use that particular local offline repository you've made (with module_hotfixes=1 so modular packages install), and use it to install software as requested, rather than taking the pile of RPMs and doing the installation by hand.

This strategy is actually how offline appliance-tools/livecd-tools and kiwi image builds are often done. You can just make that process automatic with osbuild.

Now this doesn't solve all the problems, since there's still the pesky issue of dealing with modular packages. One possible option would be to reposync the module out and merge that into your local repository's metadata. That would allow it to function the same way it does on a normal system, and have the correct tracking information so that the package manager works properly.

I'm open to ideas here, but the current way osbuild installs software into an image leads to images that potentially won't work as users expect them to.

teg commented 4 years ago

Thanks for trying out osbuild and for providing feedback.

First a minor correction: we split the regular DNF transaction in three: depsolving (libdnf), fetching (curl/librepo) and installing (rpm). osbuild handles the latter two, but depsolving is done externally in order to produce the manifest. The main reason for this is that image building should be deterministic, so we need to pin the content hashes of all our inputs.

dnf autoremove is fundamentally broken and will always do the wrong thing

We discussed this with the dnf team, and our understanding is that the current behaviour is arguably correct. The argument could also be made that we should explicitly mark some packages, I'd be happy to discuss that.

I certainly do not agree that the wrong thing will always happen. dnf autoremove does not remove any packages from a fresh image. But if you install a package manually and then remove it again, the newly installed dependencies will be removed too (though any packages part of the initial image will not).

Do you have some examples of behaviour you think shows that dnf autoremove is fundamentally broken on an osbuild-created image compared to one of the official RHEL/Fedora images?

the current way osbuild installs software into an image is justifiably insane

Let's not get carried away.

Conan-Kudo commented 4 years ago

Thanks for trying out osbuild and for providing feedback.

First a minor correction: we split the regular DNF transaction in three: depsolving (libdnf), fetching (curl/librepo) and installing (rpm). osbuild handles the latter two, but depsolving is done externally in order to produce the manifest. The main reason for this is that image building should be deterministic, so we need to pin the content hashes of all our inputs.

That seems flawed in practice. It only works as long as all the content you used always remains available. Within the Red Hat ecosystem, this isn't true on Fedora or CentOS. It's technically not true on RHEL either if you work with the default repositories. The SUSE ecosystem is a bit better with how they handle service pack/point release updates for SLE and openSUSE Leap, but this still eventually becomes a problem there. And of course openSUSE Tumbleweed is rolling, so...

dnf autoremove is fundamentally broken and will always do the wrong thing

We discussed this with the dnf team, and our understanding is that the current behaviour is arguably correct. The argument could also be made that we should explicitly mark some packages, I'd be happy to discuss that.

I certainly do not agree that the wrong thing will always happen. dnf autoremove does not remove any packages from a fresh image. But if you install a package manually and then remove it again, the newly installed dependencies will be removed too (though any packages part of the initial image will not).

Do you have some examples of behaviour you think shows that dnf autoremove is fundamentally broken on an osbuild-created image compared to one of the official RHEL/Fedora images?

So, there's a few issues with this: if packages were explicitly requested by the user (which could be a library package that also ships a tool, since that's common in RH/Fedora), then if another application is uninstalled that required it and DNF considered with no other things requiring it, it gets removed.

This has very real consequences. Packages like libcap fall into this bucket and can be autoremoved and break things.

If you're not willing to use DNF in offline mode to install the requested packages to populate the information correctly, you should at least use dnf mark to simulate the correct setup and mark the user-installed and dep-installed content properly. That will require a bit more work to make sure you figure out what to mark, but it's doable.

the current way osbuild installs software into an image is justifiably insane

Let's not get carried away.

Sorry, I'm just frustrated. This is one of these things that I do a lot of work in, both professionally and personally, and I've explored more than my fair share of tools and methods on doing it. I expected that osbuild would wind up doing this better than lorax did (which I personally disliked because the idea of using an installer for building images just adds a huge new dimension of problems, which thankfully other people finally noticed...).

I also note, we still don't have an answer here for modules...

dvdhrm commented 4 years ago

That seems flawed in practice. It only works as long as all the content you used always remains available. Within the Red Hat ecosystem, this isn't true on Fedora or CentOS. It's technically not true on RHEL either if you work with the default repositories. The SUSE ecosystem is a bit better with how they handle service pack/point release updates for SLE and openSUSE Leap, but this still eventually becomes a problem there. And of course openSUSE Tumbleweed is rolling, so...

There is no requirement for osbuild manifests to be valid for longer than necessary. The content-addressed model is used to provide strong guarantees on what data ends up in an image. It is a communication object between osbuild-manifest creators (e.g., osbuild-composer) and the osbuild pipeline engine. The fact that such manifests will be outdated (or have unavailable sources) at one point does not negate their applicability. Obviously, without the updates repository and with just the release repositories the osbuild manifests can be used for much longer. But I do not see why short-lived manifests lead to issues. Manifests are, more often than not, generated on-demand and have no long lifetime whatsoever.

Can you elaborate why you think this is "flawed in practice"?

If you're not willing to use DNF in offline mode to install the requested packages to populate the information correctly, you should at least use dnf mark to simulate the correct setup and mark the user-installed and dep-installed content properly. That will require a bit more work to make sure you figure out what to mark, but it's doable.

This does not really respond to the situation Tom described, which is that we were told all packages are considered user installed if no DNF metadata is generated. If that is not true, please elaborate.

There is an argument to be made in favor of only marking a selected set of initial packages as user installed. We are aware of that, and we can easily do that by making dnf-json (in osbuild-composer) annotate the RPMs and then add a dnf mark stage to the resulting manifest (quoting Tom: "I'd be happy to discuss that.").

I would certainly be interested in a concrete example were the current model of osbuild fails.

I also note, we still don't have an answer here for modules...

Can you elaborate which particular problems you see?

You mentioned the failsafe mechanism, but we only use the default modules (and none of these have skip_if_unavailable set, right?). Therefore, the failsafe mechanism would only be required if someone explicitly removes the default repositories (to my knowledge, this is not a supported use-case). Once we allow selecting other modules, we will need additional stages. These will use dnf to enable particular repositories, and these will be required to copy the module-metadata into the dnf-database to guarantee it's available when the repository vanishes for whatever reason.

Similar to the dnf mark issue, I would be very happy if you can provide concrete examples where the current model fails.

Conan-Kudo commented 4 years ago

That seems flawed in practice. It only works as long as all the content you used always remains available. Within the Red Hat ecosystem, this isn't true on Fedora or CentOS. It's technically not true on RHEL either if you work with the default repositories. The SUSE ecosystem is a bit better with how they handle service pack/point release updates for SLE and openSUSE Leap, but this still eventually becomes a problem there. And of course openSUSE Tumbleweed is rolling, so...

There is no requirement for osbuild manifests to be valid for longer than necessary. The content-addressed model is used to provide strong guarantees on what data ends up in an image. It is a communication object between osbuild-manifest creators (e.g., osbuild-composer) and the osbuild pipeline engine. The fact that such manifests will be outdated (or have unavailable sources) at one point does not negate their applicability. Obviously, without the updates repository and with just the release repositories the osbuild manifests can be used for much longer. But I do not see why short-lived manifests lead to issues. Manifests are, more often than not, generated on-demand and have no long lifetime whatsoever.

Can you elaborate why you think this is "flawed in practice"?

If manifests are not useful beyond the build process, there is no point in generating them. Full stop. Your existing set of inputs for your build model implies that it's possible to make reproducible image builds. However, you are (correctly) saying that this is functionally impossible in this ticket.

The way your inputs work essentially mislead users into thinking it's capable of more than it actually is. If you do not intend to support enforced version locking with reproducible inputs, then don't include a way to make people think that you can do it. Your thought process about manifests is completely the opposite of how every other system treats them, and so should not exist.

If you're not willing to use DNF in offline mode to install the requested packages to populate the information correctly, you should at least use dnf mark to simulate the correct setup and mark the user-installed and dep-installed content properly. That will require a bit more work to make sure you figure out what to mark, but it's doable.

This does not really respond to the situation Tom described, which is that we were told all packages are considered user installed if no DNF metadata is generated. If that is not true, please elaborate.

There is an argument to be made in favor of only marking a selected set of initial packages as user installed. We are aware of that, and we can easily do that by making dnf-json (in osbuild-composer) annotate the RPMs and then add a dnf mark stage to the resulting manifest (quoting Tom: "I'd be happy to discuss that.").

This is true up to a point. However, the behavior for dnf autoremove is wonky when the DNF database isn't populated, and users have historically complained about leaves being unexpectedly removed because of this in the past with PackageKit. That's why we try to make sure the DNF database is correctly populated with Lorax, LiveCD Tools, KIWI, and other image building tools.

I would certainly be interested in a concrete example were the current model of osbuild fails.

I also note, we still don't have an answer here for modules...

Can you elaborate which particular problems you see?

You mentioned the failsafe mechanism, but we only use the default modules (and none of these have skip_if_unavailable set, right?). Therefore, the failsafe mechanism would only be required if someone explicitly removes the default repositories (to my knowledge, this is not a supported use-case). Once we allow selecting other modules, we will need additional stages. These will use dnf to enable particular repositories, and these will be required to copy the module-metadata into the dnf-database to guarantee it's available when the repository vanishes for whatever reason.

Similar to the dnf mark issue, I would be very happy if you can provide concrete examples where the current model fails.

My professional interest in OSBuild is only insofar in that I expect it to support modularity properly. My personal interest is in OSBuild to simplify the Fedora image building processes. In both cases, I need both default and non-default modules to work properly for image builds. And that the resulting images aren't fundamentally broken. Right now, it would be a bad idea to use OSBuild even with default modules, because the resulting image is completely broken for ongoing usage.

Because you install software in the wrong way with OSBuild, there is no way I can trust that my image is any good for production use. If I apply configuration management to a long-running instance from this image, or if I provision a bare metal system from an image built by this system, I would expect package management to work. That will definitely not be the case with RHEL, and may not be the case with Fedora.

teg commented 4 years ago

If manifests are not useful beyond the build process, there is no point in generating them. Full stop.

Just because you have not understood something does not mean there is no possible reason for it to exist. So at the very least statements like this comes across as overconfident, and takes away from the rest of what you have to say.

I think the discussion would be more productive if you could point to practical problems you have found, ideally with instructions on how to reproduce them.

Conan-Kudo commented 4 years ago

If manifests are not useful beyond the build process, there is no point in generating them. Full stop.

Just because you have not understood something does not mean there is no possible reason for it to exist. So at the very least statements like this comes across as overconfident, and takes away from the rest of what you have to say.

I understand the value of intermediate artifacts, but I do not believe it makes any sense to expose them to people like you want to. The confusion that it will cause was one thing I did point out, and you don't seem to have an answer for that.

teg commented 4 years ago

If manifests are not useful beyond the build process, there is no point in generating them. Full stop.

Just because you have not understood something does not mean there is no possible reason for it to exist. So at the very least statements like this comes across as overconfident, and takes away from the rest of what you have to say.

I understand the value of intermediate artifacts, but I do not believe it makes any sense to expose them to people like you want to. The confusion that it will cause was one thing I did point out, and you don't seem to have an answer for that.

You are right that the potential for confusion is something we must be aware of. In particular when/if these things are exposed in high-level tools.

I'd be happy to discuss high-level design decisions like that, but I don't think this is the right forum.

I am much more interested in your expertise on modularity and any issues you can actually point to there. We think we have our bases covered, but issues with reproducers would be very greatly appreciated.

Conan-Kudo commented 4 years ago

I'd be happy to discuss high-level design decisions like that, but I don't think this is the right forum.

You have no other forum, so this seems pretty difficult for me to act on.

teg commented 4 years ago

I'd be happy to discuss high-level design decisions like that, but I don't think this is the right forum.

You have no other forum, so this seems pretty difficult for me to act on.

Feel free to open dedicated issues :)

dvdhrm commented 4 years ago

There is no requirement for osbuild manifests to be valid for longer than necessary. The content-addressed model is used to provide strong guarantees on what data ends up in an image. It is a communication object between osbuild-manifest creators (e.g., osbuild-composer) and the osbuild pipeline engine. The fact that such manifests will be outdated (or have unavailable sources) at one point does not negate their applicability. Obviously, without the updates repository and with just the release repositories the osbuild manifests can be used for much longer. But I do not see why short-lived manifests lead to issues. Manifests are, more often than not, generated on-demand and have no long lifetime whatsoever. Can you elaborate why you think this is "flawed in practice"?

If manifests are not useful beyond the build process, there is no point in generating them. Full stop.

You come here, commenting on a public open-source project, and telling its maintainers that there is no point in the project they do, "Full stop.". I find this rude and appalling and do not appreciate conversation in that tone. If you do not want to listen to arguments from our side ("Full stop."), this argument becomes tedious.

Your existing set of inputs for your build model implies that it's possible to make reproducible image builds. However, you are (correctly) saying that this is functionally impossible in this ticket.

I did not say that. The osbuild engine can build all kinds of artifacts, and is not limited to Fedora release images. The fact that Fedora update repositories are ephemeral is a restriction of Fedora, not of osbuild. Secondly, and I repeat myself, reproducibility does not necessarily imply infinite availability. The content-addressed manifest allows us to reason about image-builds simply based on the content of the manifest. It allows us to distribute image-builds without the need to verify signatures on each build machine. It allows us to cache intermediate artifacts without sacrificing coherency.

And, again, osbuild is designed to allow building more artifacts than just Fedora images (it is not even limited to OS Images).

The way your inputs work essentially mislead users into thinking it's capable of more than it actually is. If you do not intend to support enforced version locking with reproducible inputs, then don't include a way to make people think that you can do it.

We do intend to support "enforced version locking".

Your thought process about manifests is completely the opposite of how every other system treats them, and so should not exist.

I joined this project because it does not align with the status quo, because it tries something new. I appreciate that. I enjoy thinking out of the box, denying the ordinary, walking where others refuse to go.

I completely disagree with the sentiment of your statement.

This does not really respond to the situation Tom described, which is that we were told all packages are considered user installed if no DNF metadata is generated. If that is not true, please elaborate. There is an argument to be made in favor of only marking a selected set of initial packages as user installed. We are aware of that, and we can easily do that by making dnf-json (in osbuild-composer) annotate the RPMs and then add a dnf mark stage to the resulting manifest (quoting Tom: "I'd be happy to discuss that.").

This is true up to a point. However, the behavior for dnf autoremove is wonky when the DNF database isn't populated, and users have historically complained about leaves being unexpectedly removed because of this in the past with PackageKit. That's why we try to make sure the DNF database is correctly populated with Lorax, LiveCD Tools, KIWI, and other image building tools.

I am sorry, but this is quite vague. How am I supposed to test a failing dnf database, if I cannot reproduce one? I previously asked you, and I have to repeat: I would certainly be interested in a concrete example were the current model of osbuild fails.

My professional interest in OSBuild is only insofar in that I expect it to support modularity properly. My personal interest is in OSBuild to simplify the Fedora image building processes. In both cases, I need both default and non-default modules to work properly for image builds. And that the resulting images aren't fundamentally broken. Right now, it would be a bad idea to use OSBuild even with default modules, because the resulting image is completely broken for ongoing usage. Because you install software in the wrong way with OSBuild, there is no way I can trust that my image is any good for production use. If I apply configuration management to a long-running instance from this image, or if I provision a bare metal system from an image built by this system, I would expect package management to work. That will definitely not be the case with RHEL, and may not be the case with Fedora.

Can you state a single example were a current osbuild manifest with default modules is "completely broken for ongoing usage"?

You repeatedly claim complete brokenness and definite unfitness of osbuild, while lacking any concreteness in your descriptions. It makes it hard for me to take this seriously, and makes me wonder what your intention of this inquiry is. I would very much appreciate suggestions what parts to improve, and how. I would appreciate if you link to broken manifests, or broken builds. But if your feedback aims to call osbuild "completely broken", to shutdown arguments with "Full stop", and to assert dissidents "should not exist", then I fail to see value in this discussion.

Conan-Kudo commented 4 years ago

There is no requirement for osbuild manifests to be valid for longer than necessary. The content-addressed model is used to provide strong guarantees on what data ends up in an image. It is a communication object between osbuild-manifest creators (e.g., osbuild-composer) and the osbuild pipeline engine. The fact that such manifests will be outdated (or have unavailable sources) at one point does not negate their applicability. Obviously, without the updates repository and with just the release repositories the osbuild manifests can be used for much longer. But I do not see why short-lived manifests lead to issues. Manifests are, more often than not, generated on-demand and have no long lifetime whatsoever. Can you elaborate why you think this is "flawed in practice"?

If manifests are not useful beyond the build process, there is no point in generating them. Full stop.

You come here, commenting on a public open-source project, and telling its maintainers that there is no point in the project they do, "Full stop.". I find this rude and appalling and do not appreciate conversation in that tone. If you do not want to listen to arguments from our side ("Full stop."), this argument becomes tedious.

Your existing set of inputs for your build model implies that it's possible to make reproducible image builds. However, you are (correctly) saying that this is functionally impossible in this ticket.

I did not say that. The osbuild engine can build all kinds of artifacts, and is not limited to Fedora release images. The fact that Fedora update repositories are ephemeral is a restriction of Fedora, not of osbuild. Secondly, and I repeat myself, reproducibility does not necessarily imply infinite availability. The content-addressed manifest allows us to reason about image-builds simply based on the content of the manifest. It allows us to distribute image-builds without the need to verify signatures on each build machine. It allows us to cache intermediate artifacts without sacrificing coherency.

And, again, osbuild is designed to allow building more artifacts than just Fedora images (it is not even limited to OS Images).

The way your inputs work essentially mislead users into thinking it's capable of more than it actually is. If you do not intend to support enforced version locking with reproducible inputs, then don't include a way to make people think that you can do it.

We do intend to support "enforced version locking".

Your thought process about manifests is completely the opposite of how every other system treats them, and so should not exist.

I joined this project because it does not align with the status quo, because it tries something new. I appreciate that. I enjoy thinking out of the box, denying the ordinary, walking where others refuse to go.

I completely disagree with the sentiment of your statement.

If you are already intending to support them like lock files, then it's fine to have them. But your answers above seemed to indicate that you insist to generate lock files while you simultaneously know that they don't work the way people expect them. There's being different, and there's breaking people's expectations.

Also, it's not just Fedora where this doesn't work. Virtually all distributions have this problem, except for openSUSE Leap and SUSE Linux Enterprise, since those two don't have a rolling repository for the major version that is "reset" when a new point release is made.

This does not really respond to the situation Tom described, which is that we were told all packages are considered user installed if no DNF metadata is generated. If that is not true, please elaborate. There is an argument to be made in favor of only marking a selected set of initial packages as user installed. We are aware of that, and we can easily do that by making dnf-json (in osbuild-composer) annotate the RPMs and then add a dnf mark stage to the resulting manifest (quoting Tom: "I'd be happy to discuss that.").

This is true up to a point. However, the behavior for dnf autoremove is wonky when the DNF database isn't populated, and users have historically complained about leaves being unexpectedly removed because of this in the past with PackageKit. That's why we try to make sure the DNF database is correctly populated with Lorax, LiveCD Tools, KIWI, and other image building tools.

I am sorry, but this is quite vague. How am I supposed to test a failing dnf database, if I cannot reproduce one? I previously asked you, and I have to repeat: I would certainly be interested in a concrete example were the current model of osbuild fails.

My professional interest in OSBuild is only insofar in that I expect it to support modularity properly. My personal interest is in OSBuild to simplify the Fedora image building processes. In both cases, I need both default and non-default modules to work properly for image builds. And that the resulting images aren't fundamentally broken. Right now, it would be a bad idea to use OSBuild even with default modules, because the resulting image is completely broken for ongoing usage. Because you install software in the wrong way with OSBuild, there is no way I can trust that my image is any good for production use. If I apply configuration management to a long-running instance from this image, or if I provision a bare metal system from an image built by this system, I would expect package management to work. That will definitely not be the case with RHEL, and may not be the case with Fedora.

Can you state a single example were a current osbuild manifest with default modules is "completely broken for ongoing usage"?

What is a "current osbuild manifest"? The ones in your samples? Your samples are fine, and cockpit-composer doesn't expose the ability to install modular content, so you can't hit this problem in either one. Your manifest format is quite complex and hand-crafting one to expose the problem is not straightforward. I can trivially do it with a shell script that emulates osbuild behavior, but actually making the manifest is quite painful.

You repeatedly claim complete brokenness and definite unfitness of osbuild, while lacking any concreteness in your descriptions. It makes it hard for me to take this seriously, and makes me wonder what your intention of this inquiry is. I would very much appreciate suggestions what parts to improve, and how. I would appreciate if you link to broken manifests, or broken builds. But if your feedback aims to call osbuild "completely broken", to shutdown arguments with "Full stop", and to assert dissidents "should not exist", then I fail to see value in this discussion.

Look, osbuild doing something different is interesting. But that doesn't mean you should ignore the real-world usage requirements. Nor should you ignore the realities of the environment you're working in. The problem with osbuild is that it's actually a great concept, but some of the details just aren't handled right.

teg commented 4 years ago

If you are already intending to support them like lock files, then it's fine to have them. But your answers above seemed to indicate that you insist to generate lock files while you simultaneously know that they don't work the way people expect them.

I suggest opening up a separate issue if you want to discuss this. Though I'm struggling to see where you are coming from here. It is true that being able to always rebuild manifests would be nice, it is also true that in many cases where we would like that, it is currently not possible. However, that is not the reason we have manifests, and I don't understand what practical problem their existence pose to you.

If you see a way to improve on this without breaking the properties we currently have and rely on, I'd be interested in hearing more about it.

The problem with osbuild is that it's actually a great concept, but some of the details just aren't handled right.

If you open up separate issues for each of the concerns you have I think that would lead to a better discussion. Though bear in mind that we have many considerations to bear in mind, so it is unlikely we will be able to give you exactly what you expect and no features you don't care about.

cgwalters commented 2 years ago

Just for reference, rpm-ostree has a different model for "user installed" type data, xref https://blog.verbum.org/2020/08/22/immutable-%E2%86%92-reprovisionable-anti-hysteresis/ So this bug won't apply for osbuild generating rpm-ostree builds.

Also, rpm-ostree has had a lockfile implementation since this commit (migrated into Rust since then) which I think duplicates the osbuild locking.

Also, it's not just Fedora where this doesn't work. Virtually all distributions have this problem,

In Fedora CoreOS, we ship using lock files, and we added the "archive" repository for exactly this reason. I think there's been some discussion about expanding it beyond FCOS (because really, having exactly one version on mirrors makes no sense in a world of object stores and CDNs).

Conan-Kudo commented 2 years ago

Right, so generally the lockfile in either osbuild or rpm-ostree is useless if you don't have something like the archive repository. And archive repositories are not going to be common because that requires a lot of money to maintain, which is unreasonable for most distributions or people to expect to have.

supakeen commented 1 year ago

It's been a while but I'll be diving into marking packages in the DNF state database as this needs to be resolved for the Fedora installer(s).

I'll likely be marking user requested packages (the top level packages in packageSet and blueprint-requested packages) as user-installed; all the rest as dependency but I'll be reading up on dnf mark for a bit first.

It also seems that modularity is proposed for removal in f39 which might simplify some things down the road.

supakeen commented 1 year ago

Sorry, this comment was blatantly wrong (if you got an email about it); I was mixing up VM images before being sufficiently caffeinated.

Comment used to say that there are no user-marked packages on Fedora VMs/ISOs but in fact all kickstart-selected (and I believe anaconda and lorax-selected) packages are user-marked. This implies that we will be marking all top-level requested packages in either packageSet or blueprints as user.

Still figuring out groups.

supakeen commented 1 year ago

This is a nice rabbit hole. The current approach in https://github.com/osbuild/images/pull/28 is likely not very nice. It'd be much better if dnf-json could report directly what the package marks should be (called reason in libdnf). I got quite far but for groups hit the following: https://github.com/rpm-software-management/libdnf/issues/1608

bcl commented 1 year ago

I wonder if it's the right thing to do. Originally the built images weren't expected to be upgraded or extended so it didn't matter if the system was installed using rpm. It seem to me that if we want it to have a valid dnf database then maybe we need to use dnf to install things.

supakeen commented 1 year ago

I believe there are (good) historical reasons for depsolving into a set of RPMs first and then installing those RPMs. Though I have also thought about 'why not use dnf'. @thozza pointed out that there might be issues with comps if we go the route of:

depsolve
build repo with depsolved rpms
install with dnf from this repo

And I personally would be wary of 3. deciding it needs something else than 1. as that DNF is running inside the buildroot and might have different ideas about things?

supakeen commented 1 year ago

Ok, spitballing for a second here. Perhaps there's the possibility to serialize the exact inputs to the transaction during our depsolve steps and repeat those in the buildroot but with the actual run() of the transaction. This should always succeed and would mark the packages correctly?

Conan-Kudo commented 1 year ago

You could yoink the comps data and re-append it to your local rpm-repo. I think a few years ago I suggested doing this instead of the dnf -> rpm stage thing you do now:

dnf download all the things
grab the comps and modulemd files
createrepo locally, append comps and modulemd files
dnf install all the things

There are benefits to this approach:

You can introduce a process for archiving the complete inputs and avoiding redownload if unneeded or unwanted
You can introduce a process for generating delta data of inputs
You have much cheaper caching processes for re-running image builds

thozza commented 1 year ago

Note that everything needed to build the image must be pinned down in the osbuild manifest. While we can download RPMs from a repo when these are pinned using their hash even if the repo metadata changed, we would IMO not be able to do the same thing with the metadata and comps files. We could pin them down as well and if they changed, this would fail the image build. Another option would be to embed them in some form as inline files in the manifest itself, which may make it quite big.

Conan-Kudo commented 1 year ago

The lock file isn't terribly useful, but if you care about it, then extending it to have the hashes of the downloaded metadata files makes sense.

supakeen commented 8 months ago

Putting @richm's post here in full since it's also related:

For our use case, we need to have an option to keep dnf/yum metadata when installing packages. Ansible system roles integration testing uses Standard Test Interface https://docs.fedoraproject.org/en-US/ci/standard-test-interface/ - specifically - https://docs.fedoraproject.org/en-US/ci/standard-test-roles/ - this means (ideally) running every test playbook against a clean VM. But this leads to performance issues - while startup/teardown of the VM has gotten faster, the most time consuming aspect of a system roles run is the package download and installation, especially large packages, and things like kernel modules (looking at you storage kmod-vdo). We can use osbuild to build these images with the packages - but - every time the test goes to use the Ansible package module to install the packages, it checks the metadata, and has to rebuild the metadata, which takes so long as to defeat the original purpose.

So if we could build images with all of the packages and metadata, we could speed up our QE considerably.

achilleas-k commented 1 month ago

We can use osbuild to build these images with the packages - but - every time the test goes to use the Ansible package module to install the packages, it checks the metadata, and has to rebuild the metadata, which takes so long as to defeat the original purpose.

This makes me wonder if it really is the same issue.

I'm not 100% clear on what's happening with the ansible module, so maybe @richm can check my assumptions here, but the way the issue is described makes me think that there's an ansible step that's doing (the equivalent of) dnf install <some packages> and the thing that's taking too long isn't the building of the package metadata but the repository metadata. As I understand it, the ansible module is run to install some packages and the hope is that the packages are preinstalled and no action should be required, but it first needs to download all the repo metadata.

For example, on a fresh system built by osbuild, running dnf list installed should take less than a second. Creating /var/lib/dnf/history.sqlite from local data isn't the issue here. The issue is that if you run dnf install <packages that are already installed>, it will take some time to download all the repo metadata before returning with "Nothing to do".

It sounds to me that even with the dnf metadata, without repo metadata (or with repo metadata older than the metadata_expire value of dnf.conf(5)), the problem will persist.

richm commented 1 month ago

As I understand it, the ansible module is run to install some packages and the hope is that the packages are preinstalled and no action should be required, but it first needs to download all the repo metadata.

I believe this is how the ansible module is working. Which means it isn't an Ansible or Ansible dnf module problem, it is a dnf install problem.

So is the issue that osbuild does not use the repo metadata at all during package installation? Or does it use the repo metadata, but removes it?

supakeen commented 1 month ago

As I understand it, the ansible module is run to install some packages and the hope is that the packages are preinstalled and no action should be required, but it first needs to download all the repo metadata.

I believe this is how the ansible module is working. Which means it isn't an Ansible or Ansible dnf module problem, it is a dnf install problem.

So is the issue that osbuild does not use the repo metadata at all during package installation? Or does it use the repo metadata, but removes it?

It is currently not used at all. I'm working on some modularity related work recently which might mean I'll revisit the idea of having metadata available during and after build time again (and it might lead to this issues being solved).

achilleas-k commented 1 month ago

So is the issue that osbuild does not use the repo metadata at all during package installation? Or does it use the repo metadata, but removes it?

Package fetching and installation happens outside the operating system tree that's being built, so repository metadata is only available on the host that's generating the manifest (as a side effect of the depsolve). It would be technically possible to seed the new system with repository metadata, but I think that would be a strange choice. It would essentially be pre-loading caches on a cold system, caches that have a relatively short expiration even, meaning you'd probably want to refresh them on first boot anyway.

achilleas-k commented 1 month ago

The rpm repo metadata cache discussion is quite off topic from the original issue here however. Perhaps we should move this discussion back to the original issue if we want to keep talking about it.