Open timo-HERE opened 8 months ago
In which part of ORT would you like ORT to treat these as equal? Like, do you get these entries from a single analyzer run, but want only one of the entries? Or are you comparing different analyzer runs?
And how exactly should treating them the same work? Like, if consolidating these entries to one, to which one should be consolidated in general terms?
@sschuberth, these entires were for from single analyzer run, but in order to cache results effectively, these entries should be seen as one between runs.
If the filename in sourceArtifact
(js-yaml-3.13.1.tgz) and the hash
matches, why the download server would matter?
why the download server would matter?
So you're proposing to download such artifacts only once for scanning, from an arbitrary location of those locations that provide the same by (by name and hash)?
Yes, correct.
@sschuberth, In addition I just realised that ORT probably requires separate package configurations for each download location. Would be great to fix this at the same time.
id: "NPM::js-yaml:3.13.1"
source_artifact_url: "https://registry.npmjs.org/js-yaml/-/js-yaml-3.13.1.tgz"
path_excludes:
- pattern: "package/*.md"
reason: "DOCUMENTATION_OF"
comment: "Documentation within the package"
id: "NPM::js-yaml:3.13.1"
source_artifact_url: "https://artifactory.local/artifactory/js-yaml/-/js-yaml-3.13.1.tgz"
path_excludes:
- pattern: "package/*.md"
reason: "DOCUMENTATION_OF"
comment: "Documentation within the package"
Would be great to fix this at the same time.
That's a completely different area of code, though. Anyway, contributions are welcome 😉
@sschuberth, I created a new issue for that: #8236
I could also see an alternative, less disruptive solution to the problem.
Most of the source and binary artifacts have a primary public download location. Kind of "ground zero". This whole thing is caused from a fact that local caches are used for those public repositories. ORT could have a global "URL replace function" as part of ORT configuration which is used for all remote artifacts.
ort:
remoteArtifactURLs:
# Force use of https
- replace: "http://"
with: "https://"
# For Node packages, use public repository always
- replace: "https://artifactory.local/npm-virtual-repository/"
with: "https://registry.npmjs.org/"
# Java packages should be downloaded from the cache
- replace: "https://repo.maven.apache.org/maven2/"
with: "https://artifactory.local/maven-cache/"
The result would be that ORT would use those local caches or proxies and deduplication happens as a side product.
ORT could have a global "URL replace function" as part of ORT configuration which is used for all remote artifacts.
Sounds like https://github.com/oss-review-toolkit/ort/issues/6698, or?
@timo-HERE A workaround that you might be able to take and that's worked for me is to forcefully replace/insert default sources.
For NPM, I'm just plain and simple deleting all .npmrc
files, thereby forcing a distinct config just for the CI run that runs ORT:
find . -name ".npmrc" -type f -exec rm {} +
npm config set -g -- @the-scope:registry=https://${CI_SERVER_HOST}/api/v4/packages/npm
etc.
For Java, the approach is a bit more hacky (arguably), where I go into all build.gradle
files and insert mavenCentral()
before the custom repository (note that I'm using a Ruby script here, because my shell scripting skills aren't strong enough :D):
files = `find . -name "build.gradle" -type f`.lines.map(&:strip)
files.each do |filename|
content = File.read(filename)
# As per https://docs.gradle.org/current/userguide/declaring_repositories.html#sec:case-for-maven-local, mavenLocal should be avoided
new_content = content.gsub(/mavenLocal\(\)\s+/, '')
# Repository order determines lookup order and Gradle will prefer to load all artifacts from the same repository. So
# we prefer mavenCentral over our own mirror for public packages.
# See https://docs.gradle.org/current/userguide/declaring_repositories.html#sec:declaring_multiple_repositories
new_content.gsub!(/\brepositories {(\s+)maven {/, "repositories {\\1mavenCentral()\\1maven {")
File.write(filename, new_content)
end
Note that a potential addition to this script would be to first check if mavenCentral()
is already included together with a custom repository (so effectively presence of both mavenCentral()
and maven {
) and if so, we should probably remove mavenCentral()
so that it can then be inserted before the custom repository.
Both approaches seem to be working so far.
Despite my previous comment, I want to add something here that at least to me seems related, because we're facing similar issues for 2 use cases where the root problem is not a private mirror but instead 2 other problems that in the end also boil down to what "equality" means for packages.
Git historically supports different protocols and so do many Git servers, including GitHub and GitLab:
git+https://github.com/owner/repo
git+https://github.com/owner/repo.git
https://github.com/owner/repo
ssh://git@github.com/owner/repo.git
All of these are effectively equivalent in terms of source URL, but seem to be treated as different (as far as I've seen that is even the case when the hash matches).
Note that the cleanup happens for vcsProcessed
This can e.g. occur when using different package managers (or different versions of the same package manager), as is e.g. very much the case in the JavaScript ecosystem. So if a package is required in the exact same version and bundled once with SHA-1 and once with SHA-512 as hashing algorithm, it's treated as a different package.
In both cases, ORT then fails and complains about duplicate package IDs.
An example for both is mimic-fn that's a (transitive) dependency for multiple of our apps:
Some thoughts on this:
Problem 1:
vcsProcessed
is set, vcs
could/should be excluded from equality checks – I think this would be a sensible default where I wouldn't expect a lot of disagreementrevision
in an equality check is another point of debate. I could e.g. imagine a configuration setting that lets you configure strictness, e.g. strict
(= current approach, default), lenient
(= treat as equal when one value is set and one is empty/null
), none
(= exclude revision
from equality check)Problem 2:
sourceArtifact
or buildArtifact
use different hashing algorithms, download the artifacts and hash them with one of the two algorithms (e.g. SHA-512) – when the hash matches, treat packages as the same.strict
(= current approach, default), lenient
(= treat as equal when one value is set and the other is empty/null
or uses a different algorithm), none
(= exclude hash
from equality check)The first (= producing our own hashes) would probably also solve the problem that Timo has described where the same package could come from different repositories: As long as the hash matches, it should be treated as the same package.
Aside from adding new behavior, I could imagine also leveraging something we already have, namely curations: If we parse curations during the analyze
step, we could not only save some performance (e.g. no need to run things like npm view
or process its results if we already have certain data) but also fix duplicates by explicitly curating problematic fields. Imagine in my case above being able to do this:
- id: "NPM::mimic-fn:1.2.0"
curations:
source_artifact:
# I can either unset the values or, if I'm diligent, set them properly
url: ""
hash:
value: ""
algorithm: ""
If this were to be read during analyze and taken into account in parsePackage
, then we could at least handle cases for the fields that can be included in curations, i.e. everything listed in https://raw.githubusercontent.com/oss-review-toolkit/ort/main/integrations/schemas/curations-schema.json.
For Java, the approach is a bit more hacky (arguably), where I go into all
build.gradle
files and insertmavenCentral()
before the custom repository
I haven't tested this, but at least for Gradle a cleaner approach might be to use a temporary global initialization script that inserts mavenCentral()
.
Problem 1: Same source, different protocol
IMO, this could and should be solved via #6698.
Problem 2: Same source, different hashing algorithm
An example for both is mimic-fn that's a (transitive) dependency for multiple of our apps:
Unfortunately, for some reason I cannot enlarge your screenshots, so they're unreadable for me.
When
vcsProcessed
is set,vcs
could/should be excluded from equality checks
Actually, I believe we never perform equality checks on non-processed VCS info... what makes you believe we do?
Imagine in my case above being able to do this:
This should actually already be possible. And I believe does not matter really whether curations are applied before or after calling parsePackage()
during analysis. You can already use a curation to "settle" on a specific hash algorithm and value for a package, so different package metadata becomes equal, and will be de-duplicated before the download / scanning.
Unfortunately, for some reason I cannot enlarge your screenshots, so they're unreadable for me.
Strange, I've just dragged the image into the GitHub editor like usual. Can you see this one:
Actually, I believe we never perform equality checks on non-processed VCS info... what makes you believe we do?
I should have been more precise in my wording: I saw that the packages are stored in a mutableSet
. I don't have all that much Kotlin experience, but quick research yielded that sets seem to be using the equals
and hash
methods of an object to determine whether or not 2 objects are considered equal (and thus wouldn't be inserted into the set twice). I couldn't find either of these 2 methods overridden in Project
, so I assume that this makes it so that all attributes of a project are considered when determining whether or not to insert it into the set.
This should actually already be possible.
I couldn't find any way to pass --package-curations-dir
to ort analyze
– hence it doesn't seem to pick up such settings and then consequently fails when it detects what it considers duplicate packages (= 2 packages with the same ID).
Can you see this one:
Yes, that works, thanks!
all attributes of a project are considered when determining whether or not to insert it into the set.
You're right. I now get what you mean, thanks for explaining.
I couldn't find any way to pass
--package-curations-dir
toort analyze
The analyzer is the "native" place that originally applies curations. As such, the analyzer takes curations from their default locations, which are all enabled curation providers:
Other command that take a --package-curations-dir
only use that option to override existing curations without running the analyzer again; it's basically a work-around for https://github.com/oss-review-toolkit/ort/issues/6188 not being implemented yet.
So, if you want the analyzer to take curation from a "special" / non-default location, instead of using a --package-curations-dir
CLI option you should configure one of the curation providers with a custom location:
I have a case where I think ORT considers these two packages as two different ones:
Is there a way to configure ORT to consider these two source artifact URLs the same since they have same filename and hash?