pulp / pulp_deb

Debian repository plugin for Pulp (pulpproject.org)
GNU General Public License v2.0
61 stars 76 forks source link

Ensure per repository version structure content integrity #785

Open quba42 opened 1 year ago

quba42 commented 1 year ago

Unless all users do is sync, there are currently numerous ways to create repository versions that have inconsistent structure and metadata content.

Metadata content: Release

Structure content: ReleaseComponent, ReleaseArchitecture, PackageReleaseComponent

What do we mean by consistent?

The following rules are necessary and sufficient for a given repo version to be described as structurally consistent/sound:

  1. For each Package, there should be at least one PackageReleaseComponent referencing that package.
  2. For each PackageReleaseComponent the referenced ReleaseComponent should be present in the repo version.
  3. For each PackageReleaseComponent the referenced Package should be present in the repo version.
  4. For each ReleaseComponent there should be exactly one Release with the same distribution (not strictly required since Release just stores optional release file fields).
  5. For each ReleaseComponent there should be at least one PackageReleaseComponent referncing it.
  6. For each package with a given architecture associated with some Release via the above association chain, there should be a ReleaseArchitecture with that same distribution as the Release and the same architecture as the Package.
  7. For each Release, there should be at least one ReleaseComponent with the same distribution.

Example problems caused by "inconsistent" repo versions

https://github.com/pulp/pulp_deb/issues/777

Steps we are taking to ensure consistency

Ideas to improve consistency further

quba42 commented 1 year ago

Note: I am not actually sure about rule 5 above. I suppose it could be legal to have a component, that does not actually include any packages in its package indices. Not very useful, but probably legal.

sdherr commented 1 year ago

I don't agree with this one:

For each ReleaseComponentthere should be at least one PackageReleaseComponent referncing it.

It should be possible for the release and/or component to be defined but not yet have any content. As for validity checking I think it's a good idea but I'm not sure there's a perfect time to do it. You either have to check validity at the content modification time / repo version creation time (which is when I would prefer) which gives early feedback to the user but also forces them to make all their content modification changes at once in one request. Or you do it at Publication time, which makes publication an operation that can unexpectedly fail, which is not great.

But other then that I agree with everything here. I think ensuring validity is a good first step, but there's probably also some extra that could be done around automating validity. In particular I'm thinking about PackageReleaseComponent. I'm of the opinion that PackageReleaseComponent should never have been exposed over the API, and instead if it was necessary at all it should have just been an internal implementation detail.

Users never care about the existence / non-existence of a PackageReleaseComponent, they care about whether or not their package has been successfully added to / removed from the ReleaseComponent. The fact that another object has to be created to shoehorn this relationship into the pulpcore Content model is not really relevant to them. As you say above, it makes no sense to have a PRC if both the Package and the ReleaseComponent are not also in the RepoVersion.

We have wrapped our interaction with the pulp_deb API in a manager class who's only job is to ensure that our package adds / deletes also correctly create / clean up PRC objects. You could obsolete this class (and I think provide a more natural experience to other API users) if you added logic/API to allow people to ignore PRC objects entirely:

  1. If adding a Package to a ReleaseComponent, the PRC object is created automatically.
  2. If removing a Package from a ReleaseComponent, the PRC object is automatically removed.
  3. If removing a Package from a ReleaseComponent and there are no other PRC references to that package, the package itself is removed.

It may also make sense to extend that logic further, for example if the user removes a Release then also remove all ReleaseComponents, PackageReleaseComponents, and Packages that are not referenced elsewhere. I'm not convinced that's a good idea, because at that point there's no great way to "undo" the action if the user changes their mind. PackageReleaseComponents are safe to automate this way precisely because they are the bottom of the dependency graph, and adding a Package to a ReleaseComponent is the inverse of removing it.

quba42 commented 10 months ago

One concrete next step I would like to see is that removing a Package from a repo, will automatically remove any associated PackageReleaseComponents. A dangling PackageReleaseComponents is simply meaningless, and can cause orphan cleanup to fail. See #865

quba42 commented 10 months ago

Another thing we might consider: Provide a DB cleanup script to clean up dangling PackageReleaseComponents in existing repo versions. The idea is to provide a possibility to clean up if they are causing orphan cleanup to fail.