pulp / pulp_rpm

RPM support for Pulp Platform
https://pulpproject.org/pulp_rpm/
GNU General Public License v2.0
48 stars 124 forks source link

Pulp doesn't handle conflicting package filenames properly #2678

Open daviddavis opened 2 years ago

daviddavis commented 2 years ago

Version pulpcore 3.21.0.dev pulp-rpm 3.18.0.dev

Describe the bug If you add two packages with the same filename to a repo, then only one gets published/distributed (good). However, both show up in primary.xml (not so good) and the checksum can potentially not match (bad).

To Reproduce This is a bit of a contrived example but it demonstrates the problem. Basically you add two packages with the same filename to a repo and publish it.

API Commands wget https://fixtures.pulpproject.org/rpm-unsigned/bear-4.1-1.noarch.rpm wget https://fixtures.pulpproject.org/rpm-unsigned/camel-0.1-1.noarch.rpm http --form :24817/pulp/api/v3/content/rpm/packages/ file@bear-4.1-1.noarch.rpm relative_path=camel-0.1-1.noarch.rpm http --form :24817/pulp/api/v3/content/rpm/packages/ file@camel-0.1-1.noarch.rpm http :24817/pulp/api/v3/repositories/rpm/rpm/ name=test http :24817/pulp/api/v3/distributions/rpm/rpm/ name=test base_path=test repository=/pulp/api/v3/repositories/rpm/rpm/4d390079-60ff-4916-b80f-51b6756f2757/ http :24817/pulp/api/v3/repositories/rpm/rpm/4d390079-60ff-4916-b80f-51b6756f2757/modify/ add_content_units:='["/pulp/api/v3/content/rpm/packages/290113b4-0ca2-4bb4-8b5a-725d471d865a/", "/pulp/api/v3/content/rpm/packages/8df08649-2272-4006-8270-eee1d453edd2/"]' http :24817/pulp/api/v3/publications/rpm/rpm/ repository=/pulp/api/v3/repositories/rpm/rpm/4d390079-60ff-4916-b80f-51b6756f2757/

Expected behavior When I view the packages I see only one (camel) which is expected:

http :24816/pulp/content/test/Packages/c/

When I get its checksum, I get the checksum for the package named formerly known as bear-4.1-1.noarch.rpm which is also acceptable:

http :24816/pulp/content/test/Packages/c/camel-0.1-1.noarch.rpm | sha256sum
ceb0f0bb58be244393cc565e8ee5ef0ad36884d8ba8eec74542ff47d299a34c1

However, when I download the primary.xml, I see two entries: one for bear at Packages/b/bear-4.1-1.noarch.rpm (which doesn't exist) with a checksum of ceb0f0 and one for camel at Packages/c/camel-0.1-1.noarch.rpm which does exist but it has a checksum of c5c34 (which doesn't match the checksum of the actual package at /pulp/content/test/Packages/c/camel-0.1-1.noarch.rpm).

daviddavis commented 2 years ago

Here's a copy of the primary.xml:

<?xml version="1.0" encoding="UTF-8"?>
<metadata xmlns="http://linux.duke.edu/metadata/common" xmlns:rpm="http://linux.duke.edu/metadata/rpm" packages="2">
<package type="rpm">
  <name>bear</name>
  <arch>noarch</arch>
  <version epoch="0" ver="4.1" rel="1"/>
  <checksum type="sha256" pkgid="YES">ceb0f0bb58be244393cc565e8ee5ef0ad36884d8ba8eec74542ff47d299a34c1</checksum>
  <summary>A dummy package of bear</summary>
  <description>A dummy package of bear</description>
  <packager></packager>
  <url>http://tstrachota.fedorapeople.org</url>
  <time file="1659114662" build="1331831374"/>
  <size package="1846" installed="42" archive="296"/>
  <location href="Packages/b/bear-4.1-1.noarch.rpm"/>
  <format>
    <rpm:license>GPLv2</rpm:license>
    <rpm:vendor></rpm:vendor>
    <rpm:group>Internet/Applications</rpm:group>
    <rpm:buildhost>smqe-ws15</rpm:buildhost>
    <rpm:sourcerpm>bear-4.1-1.src.rpm</rpm:sourcerpm>
    <rpm:header-range start="280" end="1697"/>
    <rpm:provides>
      <rpm:entry name="bear" flags="EQ" epoch="0" ver="4.1" rel="1"/>
    </rpm:provides>
  </format>
</package>
<package type="rpm">
  <name>camel</name>
  <arch>noarch</arch>
  <version epoch="0" ver="0.1" rel="1"/>
  <checksum type="sha256" pkgid="YES">c5c34e1843847990d2c79b936309d6257279e26f907e20f9f58073a14525e1ef</checksum>
  <summary>A dummy package of camel</summary>
  <description>A dummy package of camel</description>
  <packager></packager>
  <url>http://tstrachota.fedorapeople.org</url>
  <time file="1659114663" build="1331831360"/>
  <size package="1847" installed="42" archive="296"/>
  <location href="Packages/c/camel-0.1-1.noarch.rpm"/>
  <format>
    <rpm:license>GPLv2</rpm:license>
    <rpm:vendor></rpm:vendor>
    <rpm:group>Internet/Applications</rpm:group>
    <rpm:buildhost>smqe-ws15</rpm:buildhost>
    <rpm:sourcerpm>camel-0.1-1.src.rpm</rpm:sourcerpm>
    <rpm:header-range start="280" end="1697"/>
    <rpm:provides>
      <rpm:entry name="camel" flags="EQ" epoch="0" ver="0.1" rel="1"/>
    </rpm:provides>
  </format>
</package>
</metadata>
daviddavis commented 2 years ago

FYI, it sounds like we'll not be allowing users to set relative_path on upload so the package should be named using nevra and thus, this isn't going to be a high priority for us.

pulpbot commented 1 year ago

https://bugzilla.redhat.com/show_bug.cgi?id=2151657

dralley commented 1 year ago

K so I'm finally looking at this one. Here's the state of things.

Pulp has one repository-level uniqueness filter (not constraint in the database sense, it's enforced by logic). That filter as it currently exists is N + E + V + R + A + "location_href".

location_href is filled in with NEVRA if you don't manually override it using the "relative_path" parameter. But you can set that parameter to anything, which is not great, especially because it isn't part of the global package unique constraint (so whatever value you use during an upload will be used everywhere).

Scenarios where it behaves CORRECTLY:

Scenarios where it behaves KIND OF CORRECTLY:

Scenarios where it behaves INCORRECTLY:

The general TL;DR is that using the "relative_path" parameter on the upload endpoint generally leads to suboptimal outcomes.

How to Fix:

We could do one or both but either is painful and not a complete fix. Fixing data that already exists is a can of worms I'm going to discuss on the BZ not here. That's a separate conversation.

Neither of these feel like great options? But I would probably prefer the former in the short-term.

daviddavis commented 1 year ago

We don't allow regular users to set the relative_path for packages but I think we want to allow admins to do so--let me double check on this. Would having a config variable that defaults to False (e.g. ALLOW_UNSAFE_SETTING_OF_RPM_NAMES) be an option?

ipanova commented 1 year ago

Given Daniels comment and analyses, I agree that providing relative_path on the upload creates more problems than benefits, where the benefits are still unclear.

I would be in favor of removing in the upload api the relative_path or making it no-op. We'd also then attempt to fix already existing data by properly setting the package.location_href and content_artifact.relative_path stored in DB.

@daviddavis everything that has in it's name unsafe implies no guarantee and ask for trouble for us to maintain somehow the corner case as for the user who will get non-deterministic results. Unless there is a justifiable strong usecase to keep relative_path on the upload I'd remove it/make it no-op.

daviddavis commented 1 year ago

I checked and we don't the ability to set filenames when uploading rpms.

We'd also then attempt to fix already existing data by properly setting the package.location_href and content_artifact.relative_path stored in DB.

This makes me a bit nervous. Would it be possible to let users opt and/or have a dry run option that would list packages that have bad location_hrefs/relative_paths?

ipanova commented 1 year ago

@daviddavis That probably won't be a problem for the django management command