pulp / pulp_rpm

RPM support for Pulp Platform
https://pulpproject.org/pulp_rpm/
GNU General Public License v2.0
48 stars 124 forks source link

RPM Sync Issue - Duplicate content #2271

Open pulpbot opened 2 years ago

pulpbot commented 2 years ago

Author: wibbit (wibbit)

Redmine Issue: 8619, https://pulp.plan.io/issues/8619


I've hit an odd sync issue with rpm (https://pulp.plan.io/issues/8615).

This is syncing against a pulp2 repo that is populated using pulp-admin's upload facility. I think I may know the cause, though this is conjecture.

Speaking with ttereshc, it's been confirmed that the upload command has the same issue as copying content between repositories does not perform any kind of de-duplication of data where the NEVRA is the same, but the hash differs.

I worked around this issue by creating a "dummy" repository, copying the content into it, and then setting the original repository up, to have it's feed set to the dummy repository and syncing it, this then engages the deduplication logic.

A subsequent sync of this repository from Pulp3 worked cleanly.

It strikes me, that Pulp3 probably should have been able to deal with this gracefully, I'm not familiar enough with sync logic to understand where the core problem was.

dralley commented 2 years ago

The sync process should be handling this https://github.com/pulp/pulp_rpm/pull/2668

However, I can't guarantee the same problem doesn't exist with upload, because we currently take location_href into consideration (for hacky reasons) and we should not.

related: https://github.com/pulp/pulp_rpm/pull/2586

ggainey commented 1 year ago

@dralley What problem do we think needs investigation/solving here? Is it "make sure uploading same NEVRA, diff checksum, into a repo replaces the existing one with the incoming one"?

dralley commented 1 year ago

@ggainey I think probably step 1 will be to fix the repository uniqueness constraint to exclude the location

There was some discussion about that here: https://github.com/pulp/pulp_rpm/pull/2586

Then the uniqueness constraint would just be the NEVRA. But then the issue is that the next steps may be a bit difficult. We could have semantics such that the latest-added package always replaces the older ones, but then a package with an older build time would replace ones with newer build times. Or we could keep the one with the newest build time, but then adding a package could result in a no-op. I suppose that's possible anyway if you add an already existing package.

So maybe our uniqueness constraint should be NEVRA + build time, so that at least you don't get duplicate pkgid packages, but you can still add them. I don't know that I like that much better.

So, no "perfect" solutions. Thoughts?

ggainey commented 1 year ago

So, no "perfect" solutions. Thoughts?

I'm always hesitant to rely on build_time, because "when it was built" can have very little to do with "which one do we want" when populating repositories. "Today's build was just Worng, I want to go back to Yesterday's Build" is def a thing that happens.

My thought on upload is is to assume the user knows what they're doing - if you're pushing this RPM into this repo right now, it's because you want this RPM, not whatever might be in there already with the same NEVRA. So "incoming" always wins. The one being-replaced is still in the Pulp instance, in the previous version, so the user can always recover it.

dralley commented 1 year ago

+1

praiskup commented 1 year ago

@ggainey wrote:

My thought on upload is is to assume the user knows what they're doing [snip]. So "incoming" always wins. The one being-replaced is still in the Pulp instance, in the previous version, so the user can always recover it.

I agree that the user knows best what to do. This situation is going to happen frequently in Copr.

Copr build contains set of RPMs. It will probably have to track what PULP content the particular build created/uploaded (likely a set of package hrefs, rather than artifact hrefs).

The thing is - when we want to let our users (as they know best) remove the irrelevant duplicate builds (the sets of RPMs) - when user does so we (Copr) have to tell PULP to do something.

If the latest build is removed

Likely a command like:

Which is bad, because this action doesn't add the previous NEVRA variant into the repository. The NEVRA disappears. So we have to do yet another step:

This is quite painful because we (Copr) don't know what is the previous package href to re-add (the ordering is defined by PULP). So from this perspective it looks like PULP should remember all the previous NEVRAs and keep them associated with the latest repository version - even though only one of them is eventually populated into repo metadata. Then, when whatever NEVRA duplicate is removed - the previous task could be automatically re-populated (completely managed by PULP, not Copr).

When user removes the build which is not the last one

For pulp it seems like a NO-OP (not included in the last publication). Well actually, we Copr should request at least the artifact removal. But we are not sure the artifact exists or not (since might not be referenced by any repository version, it might well be removed by orphan cleaner already).

This also IMO implies that PULP should keep the artifacts (for old duplicate builds) protected against removals. Very likely implies that duplicate RPMs stay assigned to the last repository version (till they are not explicitly content removed). Then even the --autopublish --retain-repo-versions 1 (which Copr will likely use) would be OK.