Content-app allows the use of other users Remote credentials when retreiving on-demand content.

gerrod3 commented 2 years ago

If User A and User B both sync the same on-demand content from their separate remotes the content is only created once, but two remote artifacts are created. When User B goes to download the content the credentials used for the download could be User's A since the content-app loops through the remote-artifacts for the content and chooses the first available one: https://github.com/pulp/pulpcore/blob/main/pulpcore/content/handler.py#L689-L696.

ipanova commented 1 year ago

This is a security concern in case User B happens to lose his credentials ( become no longer valid ) for the remote source. In addition we print here warning whenever download fails and we expose url of the remote which User B might not even had created https://github.com/pulp/pulpcore/blob/main/pulpcore/content/handler.py#L701

ipanova commented 1 year ago

I was thinking we could add here some rbac filtering https://github.com/pulp/pulpcore/blob/main/pulpcore/content/handler.py#L690 and filter only those remotes user can see. The problem is that User B does not necessarily need to be same user who has synced repo. It can also be User C who has been just exposed content for download ( end user) and has nothing to do with the User B credentials which were used during repo sync :/

mdellweg commented 1 year ago

I think this allows one user to steal artifacts from another if you happen to know the digest. You create a repository with a pulp_manifest claiming to contain that file but issuing 4XX, 5XX or whatever error on that file, sync it on demand and retrieve the file (now downloaded with the other remote) through the content app.

ipanova commented 1 year ago

@mdellweg you can steal content this way if it also happens to be already downloaded with query-existing-content stage

mdellweg commented 1 year ago

Good observation. Do we need to add queryset scoping to that stage?

mdellweg commented 1 year ago

Or maybe we just say: "Some level of trust is needed for people acting in the same domain." (Once we introduce that feature.)

lubosmj commented 1 year ago

One of the ideas that came into my mind was to create a new relation/object between remote artifacts and content artifacts/content/repository content (one of those, will decide later). Right now, a content unit (https://github.com/pulp/pulpcore/blob/2801c04fc241ddcbe52f71411bf5d8d95a24a52c/pulpcore/app/models/repository.py#L541, https://github.com/pulp/pulpcore/blob/241ca8586b9909a50b2e913197d0823bc968429b/pulpcore/app/models/content.py#L595, https://github.com/pulp/pulpcore/blob/241ca8586b9909a50b2e913197d0823bc968429b/pulpcore/app/models/content.py#L684) is considered to be a global unit attached to repositories. Thus, there can be a bunch of remote artifacts/content artifacts per one content unit. Only the one content unit is attached to a repository (or a version). Right now, we lose the track of which remote artifacts were initially created for that content unit after syncing.

If we could isolate and create a set of related remote artifacts, content artifacts, and repository content units, so that after every sync operation there will be a unique set of them, only that particular remote that was "associated" with the sync task, the exact same remote artifacts, and repository contents will be used to additionally download on-demand data.

Benefits: Users will not be able to use remote configurations of other users (because we will keep the track of the tuples (remote,remoteartifact,contentremoteartifact) per repository version and other irrelevant remotes will not be considered at all). The content will be always downloaded from the same source as declared (right now, we iterate seemingly randomly through remote artifacts and try to record the first downloaded result).

Questions:

How are we going to deal with an extensively large amount of "duplicated shallow" objects/relations in the database?
How could this affect ACS?
How will resyncing affect the relations/sets? Will there be a new set after every sync?

lubosmj commented 1 year ago

Actually, why do we need to deduplicate content units? What really needs to be deduplicated are artifacts because those can be huge. If we allow duplicates of content units that can be attached to repository versions, we should be fine, should not we?

bmbouter commented 1 year ago

remote_and_content_models

I made this by first installing some software in oci_env:

oci-env shell
pip install pydotplus
dnf install graphviz -y

Then make the visualizations:

graph_models core -o remote_and_content_models.jpg --include-models ContentArtifact,RemoteArtifact,Content,Artifact,Distribution,Remote,BaseDistribution,Repository,AlternateContentSource,RepositoryVersion,RepositoryContent
graph_models core -o pulpcore_models.jpg

Then I moved them from the oci_env to my host for posting with:

podman ps  # Get the container ID

podman cp bfe0f773200e:/all_pulpcore_models.jpg ~/Downloads/
podman cp bfe0f773200e:/pulpcore_models.jpg ~/Downloads/

Here's the all models for pulpcore which is where I manually picked the models of interest (have a look in case I missed any).

all_pulpcore_models

lubosmj commented 1 year ago

To visualize the proposal, I am attaching a couple of pictures below.

In the first picture, there is shown how the database will look right now after executing two sync tasks where different remotes were used to sync the same content considering distinct repositories:

Idea 1: Content units are duplicated across repository versions once different remotes were used to sync them.

In the second picture, there is depicted how relations/objects would look after performing the same workflow. The content is duplicated, thus one content unit exists per repository content/content artifact/remote artifact (one of those). It would require us to relax uniqueness constraints on the database level to allow this scenario to happen.

Disadvantage: twice as many objects (i.e., content, contentartifact) in the database as we have right now.

Idea 2 (NEW): Additional relation between remote artifacts and repository content is created to preserve/isolate the source of content

To reduce the number of objects in the database, it might be possible even to reference remote artifacts directly from repository content objects. The relations will look like this:

Disadvantage: unnecessary coupling of repository content to remote artifacts and remotes even though these objects are laying on different layers of abstraction.

Another question is whether it is possible to preserve the relation between remote artifacts and content/repository artifacts through stages in the sync pipeline. Are we sure we can share the knowledge of artifacts and content between stages?

lubosmj commented 1 year ago

Sadly, Idea 2 may not be sufficient for pull-through cache workflows. There might be a case where a user syncs content with the on-demand policy and another user addds a remote to the distribution to enable pull-through cache. Then, for the pull-through cache workflows, there are no repository content objects involved at all. So, we cannot benefit from the relation and the second user can still end up using another user's credentials. Is that correct, @gerrod3? Or, there is no notion of remote artifacts?

gerrod3 commented 1 year ago

Sadly, Idea 2 may not be sufficient for pull-through cache workflows. There might be a case where a user syncs content with the on-demand policy and another user addds a remote to the distribution to enable pull-through cache. Then, for the pull-through cache workflows, there are no repository content objects involved at all. So, we cannot benefit from the relation and the second user can still end up using another user's credentials. Is that correct, @gerrod3? Or, there is no notion of remote artifacts?

Pull-through caching uses the remote on the distribution to find the remote artifact to use to gather the content. There is this assumption in the sync pipeline that each remote artifact has one url per content artifact so pull-through cache uses the url of the request to the content-app and associated remote to find the one remote artifact with the "matching" url. In this case we are not looping through remote artifacts, just the one for the remote if it exists, else we create a new one at request time.

ggainey commented 1 year ago

I have a problem with Idea 1. Losing Content's uniqueness is going to have a ripple effect into various places that currently can assume "If I ask for a piece of Content by its 'natural keys', I am only ever going to get one answer". Idea-1 changes that assumption, in ways that will require code-change to adapt to.

I'm a much bigger fan of Idea-2. This one places the burden on "which remoteartifact to use" on the repository being sync'd (which does know/care about 'what remote is being used to sync'), instead of requiring every user of Content objects to have to be aware of them being potentially non-unique.

ggainey commented 1 year ago

OK, I am going to attempt to summarize the outcome of a meeting to discuss these proposals. Attendees included @ggainey @bmbouter @dkliban @gerrod3 @dralley @mdellweg .

To sum up:

We are good to go with a modification of "Idea 2" above.
The modification is that the RepoContent-to-RemoteArtifact relation needs to go through a mapping-table (which I will call RC2RA for this discussion), mapping RepositoryContent-pulp_id to RemoteArtifact-pulp_id.
- Among the reasons is that a Single Content unit can have multiple Artifacts, each of which has its own RemoteArtifact.
A consideration of this approach is, "who has to be responsible for removing entries from RC2RA"? The (current) answer is. "removing a RepositoryContent should cascade through RC2RA".
Should it cascade to removing the RemoteArtifact as well? I think the answer we came up with is "yes", but my notes are incomplete.

Some questions that came up:

How does this interact with orphan-cleanup?
- "In complicated and painful ways", is likely. More thought needed here!
How can we take advantage of this when deleting Remotes?
- See prev RE "complicated"

I am certain that I have missed or misstated some of the meeting output, calling on the other attendees to chime in and straighten me out here :)

mdellweg commented 1 year ago

OK, I am going to attempt to summarize the outcome of a meeting to discuss these proposals. Attendees included @ggainey @bmbouter @dkliban @gerrod3 @dralley @mdellweg .

To sum up:
* We are good to go with a modification of "Idea 2" above.

* The modification is that the RepoContent-to-RemoteArtifact relation needs to go through a mapping-table (which I will call RC2RA for this discussion), mapping RepositoryContent-pulp_id to RemoteArtifact-pulp_id.

  * Among the reasons is that a Single Content unit can have multiple Artifacts, each of which has its own RemoteArtifact.

* A consideration of this approach is, "who has to be responsible for removing entries from RC2RA"? The (current) answer is. "removing a RepositoryContent should cascade through RC2RA".
My understanding is, RepositoryContent should cascade to delete RC2RA, because that removes the "is used in a repository" information. On the other hand, deleting a RemoteArtifact should be protected by that very link to keep the on demand repository version intact. Together with Remote deletion being protected by RemoteArtifacts, this carries the proper information that a "remote is use by a repository".
* Should it cascade to removing the RemoteArtifact as well? I think the answer we came up with is "yes", but my notes are incomplete.
I do not think you can cascade delete objects through a many-to-many relationship. And we should not do that, because the same remote artifact may be used in a different repository. (If there were a "remove if many2many relation is empty" we could use that.) A RemoteArtifact (even today) is deleted if either the content (via orphan cleanup) or the remote is deleted. And it should be kept that way. We can additionally add a remove-orphan-ra step to the cleanup task, because a content still in use could gather an unlimited amount of orphaned RemoteArtifact(s). Basically, RemoteArtifact is a new thing that can get orphaned.

Some questions that came up:
* How does this interact with orphan-cleanup?
See comments above. The hard implication is not about cleanup, but the fact that orphaned on-demand content will plainly be unusable. This is not an issue for upload, where we turn it into immediate content, nor with sync, where we will create a new usable RemoteArtifact. The thing to consider is the modify action. We can only ever safely copy content from one repository to another (carrying its RemoteArtifact and duplicating the RC2RA).
  * "In complicated and painful ways", is likely. More thought needed here!

* How can we take advantage of this when deleting Remotes?
Easy. Orphaned content will not stop remotes from being deleted, while on-demand content in a repository will (by protecting the RemoteArtifact through RC2RA from the RepositoryContent.
  * See prev RE "complicated"
I am certain that I have missed or misstated some of the meeting output, calling on the other attendees to chime in and straighten me out here :)

Open Questions for me: How does this play with ACS?

pedro-psb commented 1 month ago

since the content-app loops through the remote-artifacts for the content and chooses the first available one: https://github.com/pulp/pulpcore/blob/3f1887d73c33d9dd4246c584a3ec8a1903698083/pulpcore/content/handler.py#L815-L818

Is there any reason why we can't update the selection logic of the remote_artifact? This looks really fragile.

Right now, we lose the track of which remote artifacts were initially created for that content unit after syncing.

On the content handler context we can know the right remote for that request, and thus filter the appropriate remote_artifact.

The main content-handler dispatch is done in _match_and_stream, and there we can uniquely identify the associated remote via distro.repo.remote or distro.pub.repover.repo.remote (not sure about distro.remote, which is meant for pull-through caching). https://github.com/pulp/pulpcore/blob/93f241f34c503da0fbac94bdba739feda2636e12/pulpcore/content/handler.py#L596-L598

So if we passed that associated_remote to _stream_content, we could just filter that and not risk using "someone's else" remote:

# on _stream_content_artifact method
remote_artifacts = content_artifact.remoteartifact_set.filter(
    remote=associated_remote).order_by_acs()

Does that make sense?

mdellweg commented 1 month ago

I really like the fresh minds approach here. The one thing that comes to my mind is that content (being on-demand) can be moved from one repository to another. And so you cannot be sure that the remote the repository is pointing to really is the same the content was created with. There could even be two content units requiring different remotes being part of the same repository version.

ipanova commented 1 month ago

This approach could fix some cases but not all of them, right? @pedro-psb For example the issue you are working on where the remote url does not change but the remote content does ( rpm nevra does not change while it got signed, sha changes). The remote artifact download, even if pointing to correct remote.url would still fail on serving because the remote content has changed, would not it?

@mdellweg having a repo with remote and a copied into it on_demand content (different remote) sounds really messy, even if we seem to be allowing this as of today

mdellweg commented 1 month ago

I kept wondering if using the same objects for persisted and on-demand content is a good idea. Only creating "real" content for things we have the actual artifacts could help us contain trust issues with on-demand upstreams way better.

pedro-psb commented 1 month ago

For example the issue you are working on where the remote url does not change but the remote content does ( rpm nevra does not change while it got signed, sha changes). The remote artifact download, even if pointing to correct remote.url would still fail on serving because the remote content has changed, would not it?

What you are describing is more closely related to #5012.

On this issue, the RemoteB (associated with RepoB) replaces the Sum1 pkg by Sum2 as you said, but that is actually synced with on-demand for RepoB.

Then, if we request that nevra pkg from the DistA url (associated with RepoA, and RemoteA was not changed), we sometimes use the RemoteB and get the wrong package. This happens because:

(a) They shared the same ContentArtifact when remotes contained the same same nevra Sum1 pkg. This CA has two RA. (b) The RA associated with the CA is picked by chance. Since neither have related acs, pulp_id is used for sorting (as shown by <query>.explain()), which means sorting a uuid4 (randomly generated) key.

ipanova commented 3 weeks ago

What you are describing is more closely related to https://github.com/pulp/pulpcore/issues/5012.

On this issue, the RemoteB (associated with RepoB) replaces the Sum1 pkg by Sum2 as you said, but that is actually >synced with on-demand for RepoB.

I am afraid we cannot take issues with remote artifacts and solve them separately. We should rather collect all of them and come up with a design change that will cover all of them. You proposal in isolation fixes this particular issue, so correct RA will be picked based on the remote associated to the repo. But what about those repos that do not have remote permanently associated to it? One could just create a repo, a remote and then in the sync call use that remote. Also this does not solve the copy usecase, as mentioned, one might end up having in one repo on-demand content coming from X remotes.

pedro-psb commented 3 weeks ago

That's fair, we really need a broader approach here. I've opened an issue for the case I'm working #5725

lubosmj commented 1 week ago

We may encourage users to use domains in case of worrying about using other user's credentials during the on-demand download.

pulp / pulpcore

Content-app allows the use of other users Remote credentials when retreiving on-demand content. #3212