refactor: make it clear we're working with PURLs

ctron commented 6 days ago

As it feels the confusion happens quite often, even to me, I think we should rename things.

We have "packages", which have some meaning according to our glossary: https://github.com/trustification/trustify/blob/main/docs/glossary.md#package

A package is an atomic artifact or component. Packages may be addressed using pURLs. A package may be described by an SBOM describing how it is created and its contents. A package may certainly contain other packages (e.g. shading one Java jar into another). A package may also be the sole member of a Product (UBI-8.0.13-x86.oci may be the singular package within the "UBI 8.0.13-x86" product). A package is one step more abstract than an artifact.

However, in the next section that definition actually calls out PURLs:

Package URLs (pURLs) are possibly ambiguous names applied to packages. […]

So there are "packages", and "PURLs", which can be used to give "packages" a name.

In our current API and database structure, we have (because history) several entries, endpoints, and services which are named "packages", but actually deal with PURLs only.

This PR tries to clean up the user facing API (REST endpoints) by changing the base URL from /api/v1/package to /api/v1/package/by-url, when it comes to PURL based calls.

Next steps

It would also be reasonable to rename functions, services, and database structures towards PURLs. But naming is hard, so let's start small.

Alternatives

One alternative could be to use /api/v1/purl. However, those API still deal with packages, just by using PURLs as name. So keeping "packages" make sense to me.

JimFuller-RedHat commented 6 days ago

random aside - in corgi land we sidestepped the overloading of the term 'package' by just using the equally imprecise term of 'component' which always has a 1:1 relationship with a single PURL. That gave us some 'wiggle' room if ever we had to assign a purl to something like a single script or something less then a package.

bobmcwhirter commented 6 days ago

I think in my mind, a package is something that can be referenced by a pURL.

A pURL is just a name for a pURL-addressible thing.

The ambiguity I think I was talking about in the glossary is more around... a pURL doesn't describe the binary bytes produced. Those would be artifacts.

eg, a specific build of pkg:maven/org.apache/log4j@1.2.3 with a given hash of sha256:<blah> would be a non-ambiguous artifact, based on the package referenced by pkg:maven/org.apache/log4j@1.2.3.

"Bob McWhirter" is a name, somewhat ambiguous (there's at least one Canadian and one Arizonan sharing my name). This particular Bob who is typing now is the the artifact with my particular sha256. I think?

All that being said, I'm not opposed to by-url, but that makes me wonder, is there any other non-url way to discuss a package, beyond our internal uuid? Or do we consider that sufficient to bifurcate?

I think similarly, while other systems may store "URLs", they are ultimately speaking about the thing behind the URL, e.g., the endpoint, the webpage, the podcast you can download.

Naming is hard. Naming things that name things is even harder.

Do we work with names? Do we work with things, and just use names to refer to them?

I dunno.

bobmcwhirter commented 6 days ago

I guess I may fall into the camp of "a package is anything that can be described by a pURL" which might as well just be a pURL. Maybe.

ctron commented 6 days ago

All that being said, I'm not opposed to by-url, but that makes me wonder, is there any other non-url way to discuss a package, beyond our internal uuid? Or do we consider that sufficient to bifurcate?

Right now, all that data comes from PURLs. We also do have packages, but in the context of an SBOM. Going in the direction of what @JimFuller-RedHat mentioned before I guess.

We also have CPEs. And we also have file-hashes in the context of an SBOM. All of those reference "packages of an SBOM".

I think we are on the same page, PURLs are names for packages. But one package can have multiple PURLs assigned, and additionally other names of other types (like CPEs, digests, and whatever the future will bring).

If we'd add functions/endpoints to search for CPEs or hashes. Where would they end up in the HTTP endpoint namespace?

JimFuller-RedHat commented 6 days ago

when you say 'But one package can have multiple PURLs assigned,'

Do you mean a package can contain other packages which have purls or do you mean a single package can have multiple purls ?

If the later then eg. a single unique id assigned to a single package creates a contract ... put another way if we do have other purls that need to map to a package I would suggest pointing to canonical purl (ultimately this is 'our purl ') rather then directly map to package.

This is a subtle difference but we have already seen differences in who mints a purl ... sometimes we are the first person to do so, then sometime later upstream mints a purl. Also we should expect wholesale refactoring of purls a real possibility as that proto spec matures and upstream work more with it ... it becomes burdensome to have to update all associations in database, etc. when this happens.

From a REST API (conceptual model) pov we can make the association transparent ... eg the above applies more to the logical internal model to handle unknown change later on .

Lastly, I could be easily convinced I am overthinking all this....

bobmcwhirter commented 6 days ago

I guess I was thinking CPEs are one possible name/identifier for products (the stuff Dejan is working on).

ctron commented 6 days ago

In SPDX-land, a single package can have a list of external references, which can by of type purl: https://spdx.github.io/spdx-spec/v2.3/package-information/#721-external-reference-field … which leads to each package having a list of cpes or purls attached, if I get that right.

And I do recall a conversation where that might actually make sense. Publishing the same artifact to multiple maven repositories.

bobmcwhirter commented 6 days ago

I know we're going into the weeds here, but I also think pkg:maven/com.foo/bar@1.2.3 implies ?repository_url=m2.maven.org or whatever maven central's repo is.

So the same jar published elsewhere would have to include ?repository_url=somewhere.else

JimFuller-RedHat commented 6 days ago

+1 to external reference ... and maybe all this kind of stuff be represented as labels/tags ? maybe that is overloading generic thing too much and worth representing in core model as 'external-reference' ... that makes it clearly different then the canonical purl.

ctron commented 6 days ago

I think the model that we have is actually quite good. We have base PURLs, versioned PURLs, qualified PURLs. Referencing those from SBOM packages/components. Also towards CPEs and hashes.

It's only the names that feel confusing.

carlosthe19916 commented 6 days ago

just to be sure. The current way the UI is fetching packages is:

First step: get the list of packages from GET /api/v1/package. Each item from the list obtained here has a field uuid. Then
Second step: get a single package using GET /api/v1/package/{uuid}

Are those step still going to be valid? I would expect "yes" right?

jcrossley3 commented 6 days ago

As it feels the confusion happens quite often, even to me, I think we should rename things.

Can you describe the confusion that happens quite often? What problems are we solving?

This PR tries to clean up the user facing API (REST endpoints) by changing the base URL from /api/v1/package to /api/v1/package/by-url, when it comes to PURL based calls.

I don't see how this clarifies things. To me, /api/v1/package/{purl} is pretty clear. I'm asking for the package referenced by a purl.

I also don't understand the terms "base", "qualified" and "versioned". Will the consumers of our API understand them?

carlosthe19916 commented 6 days ago

Also, a general thought:

There must be a single way of fetching an Entity (single entity). I mean, having /api/v1/package/by-purl/{key} and /api/v1/package/{key} seems illogical to me.
I have no issues with multiple "search" endpoints (list of entities) /api/v1/package?q=abc and /api/v1/package/by-purl?q=abc where these endpoints return a paginated result.

ctron commented 6 days ago

As it feels the confusion happens quite often, even to me, I think we should rename things.

Can you describe the confusion that happens quite often? What problems are we solving?

I also did stumble over this myself when navigating the code. You see something that's named "package", but indeed it's a PURL.

This PR tries to clean up the user facing API (REST endpoints) by changing the base URL from /api/v1/package to /api/v1/package/by-url, when it comes to PURL based calls.

I don't see how this clarifies things. To me, /api/v1/package/{purl} is pretty clear. I'm asking for the package referenced by a purl.

How would you ask then for a package by CPE?

I also don't understand the terms "base", "qualified" and "versioned". Will the consumers of our API understand them?

I don't know, but they already existed. And, they only make sense on the context of a PURL. Not in the context of a "package". So it might be that you are too confused by the naming.

ctron commented 6 days ago

just to be sure. The current way the UI is fetching packages is:
* First step: get the list of packages from `GET /api/v1/package`. Each item from the list obtained here has a field `uuid`. Then

* Second step: get a single package using `GET /api/v1/package/{uuid}`
Are those step still going to be valid? I would expect "yes" right?

That depends on what you mean by "package". Package of an SBOM no. Package referenced by a CPE, no. Package referenced by a PURLs, yes.

ctron commented 6 days ago

Also, a general thought:

* There must be a single way of fetching an Entity (single entity). I mean, having `/api/v1/package/by-purl/{key}` and `/api/v1/package/{key}` seems illogical to me.

Correct. But indeed what you get back is information coming from the PURL. Not from the package itself. Basically it splits the PURL into a lot of database entities and returns that.

* I have no issues with multiple "search" endpoints (list of entities) `/api/v1/package?q=abc` and `/api/v1/package/by-purl?q=abc` where these endpoints return a paginated result.

That's why one idea was to just go for /api/v1/purl. Because basically you're only searching for PURLs. And they are references packages inside of SBOMs. And so they reference SBOMs. But you could also find an SBOM package by CPE or hash (not today, but hopefully in the future).

jcrossley3 commented 6 days ago

That's why one idea was to just go for /api/v1/purl. Because basically you're only searching for PURLs.

What is the use case for that? To me, a purl is not a resource -- it's just the name of a resource, in this case a Package.

Do we have a need to return a paginated list of names?

jcrossley3 commented 6 days ago

Could some of the confusion be coming from the existence of both a package and an sbom_package table? I would expect an sbom record to relate to a package record somehow.

ctron commented 5 days ago

That's why one idea was to just go for /api/v1/purl. Because basically you're only searching for PURLs.

What is the use case for that? To me, a purl is not a resource -- it's just the name of a resource, in this case a Package.

Do we have a need to return a paginated list of names?

Looks like that. If you take a look at e.g. PackageService::package_by_uuid, which returns PackageDetails, that's simply a decomposed PURL, stored in tables. And it's absolutely not about "packages".

Could some of the confusion be coming from the existence of both a package and an sbom_package table? I would expect an sbom record to relate to a package record somehow.

Right, the package table was there before. But it only stores PURLs, not packages. And sbom record relates to an sbom_package, which relates to zero or more packages (which indeed are PURLs). It also relates to one or more cpes, which is correct.

That's why I think it makes sense to rename services and entities too. But before doing all the work, I think we should agree on that.

jcrossley3 commented 5 days ago

Right, the package table was there before.

Before what? If you mean before we implemented this, then I might question why the existing package table wasn't incorporated into that design.

But it only stores PURLs, not packages.

This sentence is confusing. A PURL is just a name. It's a logical concatenation of attributes of a package (the concept, not the table). A package will have those attributes whether a particular SBOM says it has a PURL/CPE/identifier or not, right?

From the above doc, "SBOM packages are an entity of their own, but may have zero or more PURLs or CPEs (or other identifiers)". That sentence betrays a lack of shared understanding, I think. In an abstract sense, a package will always have a PURL, because it can be constructed from the package's attributes. From a literal reading of the spec, a package may not have an identifier, but is that what we should be modeling in the database?

My intuition is that our database designers -- Jens and Bob -- need to agree on whether we're modeling entities in the abstract or according to a spec.

ctron commented 5 days ago

Right, the package table was there before.

Before what? If you mean before we implemented this, then I might question why the existing package table wasn't incorporated into that design.

Because it didn't fit.

But it only stores PURLs, not packages.

This sentence is confusing. A PURL is just a name. It's a logical concatenation of attributes of a package (the concept, not the table). A package will have those attributes whether a particular SBOM says it has a PURL/CPE/identifier or not, right?

I don't think it's the sentence that is confusing, but the structure in the database. The package, versioned_package, qualified_package table only store decomposed parts of PURLs. I see the reason for that. It's the naming that is confusing!

Actually the package does not. It comes from the context of the SBOM. The same package can be named by different SBOM by different names.

In an abstract sense, a package will always have a PURL, because it can be constructed from the package's attributes. From a literal reading of the spec, a package may not have an identifier, but is that what we should be modeling in the database?

Again, in the database we model PURLs (with tables mentioned above). Not packages.

I think the database structure works just fine. We have (actual) SBOM packages, named by PURLs and CPEs. (any maybe others in the future). It simply is the naming.

bobmcwhirter commented 5 days ago

While I think a package is "anything that can be address by a Package URL (pURL)" I don't hold it strongly enough, nor do I have enough about SBOM packages in my head to disagree with @ctron right now.

My vote would be to proceed (after we winnow down other outstanding package-related PRs.......) and see how we feel with the purl-centric naming.

ctron commented 2 days ago

I updated the PR, renaming the package tables to "purl". Renaming the fields as well.

trustification / trustify

refactor: make it clear we're working with PURLs #468

Next steps

Alternatives