oss-review-toolkit / ort

A suite of tools to automate software compliance checks.
https://oss-review-toolkit.org
Apache License 2.0
1.59k stars 309 forks source link

Add `provider` to Package #155

Closed jeffmcaffer closed 1 year ago

jeffmcaffer commented 6 years ago

Packages have a packageManager (e.g, npm, maven, ...). Since a given type of package could come from many different places, the Package should also talk about a provider. The provider should not be the URL of the repository, rather the notional name of the repository (e.g., npmjs.org, github.com). This allows the repos to move and change their URL structure without affecting the identity of the data stored in ORT.

sschuberth commented 6 years ago

@mnonnenmacher I believe what's called "provider" here is what we refer to as "provenance" in our idea.

jeffmcaffer commented 6 years ago

provenance is a pretty loaded term that carries with it very deep meaning for some. Seems like here we need the very simple implication that "this is which package foo we are talking about"

mnonnenmacher commented 6 years ago

I think this information should be added to the RemoteArtifact and VcsInfo models, because these are the two places where we reference URLs (apart from the homepage URL). For RemoteArtifact this could be something like "npmjs.org", "JCenter", or "Maven Central". For VcsInfo "github.com", "bitbucket.com", and so on. My problem is how we should auto-detect the values for those fields, e.g. if we take part of the URL like "github.com" this contradicts the idea of having something URL independent. Maybe we would have to maintain a mapping from URL to provider name?

jeffmcaffer commented 6 years ago

@mnonnenmacher agreed, having a table that bi-directionally maps provider names to host names makes sense. It likely also makes sense to keep the provider names as generic as possible. For example, we recently ran into some identity problems because some folks were using "npmjs.org" vs "npmjs.com" as the provider. it turns out they are the same and going to npmjs.org forwards to npmjs.com.

To isolate the data from these sorts of variations and changes, using just "npmjs" would be more resilient. That's also inline with your other examples like "maven central" etc. For simplicity perhaps we say that provider names need to be valid url segments that do not require any quoting. (e.g., no spaces, no funny chars, ...) and have them be case insensitive and NOT case preserving. (or just spec lowercase).

sschuberth commented 6 years ago

This somewhat relates to https://github.com/heremaps/oss-review-toolkit/issues/20.