oss-review-toolkit / ort

A suite of tools to automate software compliance checks.
https://oss-review-toolkit.org
Apache License 2.0
1.57k stars 308 forks source link

Deduplicate source artifacts to scan based on file name and hash #8127

Open timo-HERE opened 8 months ago

timo-HERE commented 8 months ago

I have a case where I think ORT considers these two packages as two different ones:

 name=js-yaml,
 version=3.13.1,
 purl=pkg:npm/js-yaml@3.13.1,
 sourceArtifact=RemoteArtifact(url=https://registry.npmjs.org/js-yaml/-/js-yaml-3.13.1.tgz,
 hash=Hash(value=aff151b30bfdfa8e49e05da22e7415e9dfa37847,
 url=git+https://github.com/nodeca/js-yaml.git,
 revision=665aadda42349dcae869f12040d9b10ef18d12da,

 name=js-yaml,
 version=3.13.1,
 purl=pkg:npm/js-yaml@3.13.1,
 sourceArtifact=RemoteArtifact(url=https://artifactory.local/artifactory/js-yaml/-/js-yaml-3.13.1.tgz,
 hash=Hash(value=aff151b30bfdfa8e49e05da22e7415e9dfa37847,
 url=https://github.com/nodeca/js-yaml.git,
 revision=665aadda42349dcae869f12040d9b10ef18d12da,

Is there a way to configure ORT to consider these two source artifact URLs the same since they have same filename and hash?

sschuberth commented 8 months ago

In which part of ORT would you like ORT to treat these as equal? Like, do you get these entries from a single analyzer run, but want only one of the entries? Or are you comparing different analyzer runs?

And how exactly should treating them the same work? Like, if consolidating these entries to one, to which one should be consolidated in general terms?

timo-HERE commented 7 months ago

@sschuberth, these entires were for from single analyzer run, but in order to cache results effectively, these entries should be seen as one between runs.

If the filename in sourceArtifact (js-yaml-3.13.1.tgz) and the hash matches, why the download server would matter?

sschuberth commented 7 months ago

why the download server would matter?

So you're proposing to download such artifacts only once for scanning, from an arbitrary location of those locations that provide the same by (by name and hash)?

timo-HERE commented 7 months ago

Yes, correct.

timo-HERE commented 7 months ago

@sschuberth, In addition I just realised that ORT probably requires separate package configurations for each download location. Would be great to fix this at the same time.

id: "NPM::js-yaml:3.13.1"
source_artifact_url: "https://registry.npmjs.org/js-yaml/-/js-yaml-3.13.1.tgz"
path_excludes:
- pattern: "package/*.md"
  reason: "DOCUMENTATION_OF"
  comment: "Documentation within the package"
id: "NPM::js-yaml:3.13.1"
source_artifact_url: "https://artifactory.local/artifactory/js-yaml/-/js-yaml-3.13.1.tgz"
path_excludes:
- pattern: "package/*.md"
  reason: "DOCUMENTATION_OF"
  comment: "Documentation within the package"
sschuberth commented 7 months ago

Would be great to fix this at the same time.

That's a completely different area of code, though. Anyway, contributions are welcome 😉

timo-HERE commented 7 months ago

@sschuberth, I created a new issue for that: #8236

I could also see an alternative, less disruptive solution to the problem.

Most of the source and binary artifacts have a primary public download location. Kind of "ground zero". This whole thing is caused from a fact that local caches are used for those public repositories. ORT could have a global "URL replace function" as part of ORT configuration which is used for all remote artifacts.

ort:
  remoteArtifactURLs:
  # Force use of https
  - replace: "http://"
    with: "https://"
  # For Node packages, use public repository always
  - replace: "https://artifactory.local/npm-virtual-repository/"
    with: "https://registry.npmjs.org/"
  # Java packages should be downloaded from the cache
  - replace: "https://repo.maven.apache.org/maven2/"
    with: "https://artifactory.local/maven-cache/" 

The result would be that ORT would use those local caches or proxies and deduplication happens as a side product.

sschuberth commented 7 months ago

ORT could have a global "URL replace function" as part of ORT configuration which is used for all remote artifacts.

Sounds like https://github.com/oss-review-toolkit/ort/issues/6698, or?

clemens commented 1 month ago

@timo-HERE A workaround that you might be able to take and that's worked for me is to forcefully replace/insert default sources.

For NPM, I'm just plain and simple deleting all .npmrc files, thereby forcing a distinct config just for the CI run that runs ORT:

find . -name ".npmrc" -type f -exec rm {} +
npm config set -g -- @the-scope:registry=https://${CI_SERVER_HOST}/api/v4/packages/npm
etc.

For Java, the approach is a bit more hacky (arguably), where I go into all build.gradle files and insert mavenCentral() before the custom repository (note that I'm using a Ruby script here, because my shell scripting skills aren't strong enough :D):

files = `find . -name "build.gradle" -type f`.lines.map(&:strip)
files.each do |filename|
  content = File.read(filename)

  # As per https://docs.gradle.org/current/userguide/declaring_repositories.html#sec:case-for-maven-local, mavenLocal should be avoided
  new_content = content.gsub(/mavenLocal\(\)\s+/, '')

  # Repository order determines lookup order and Gradle will prefer to load all artifacts from the same repository. So
  # we prefer mavenCentral over our own mirror for public packages.
  # See https://docs.gradle.org/current/userguide/declaring_repositories.html#sec:declaring_multiple_repositories
  new_content.gsub!(/\brepositories {(\s+)maven {/, "repositories {\\1mavenCentral()\\1maven {")

  File.write(filename, new_content)
end

Note that a potential addition to this script would be to first check if mavenCentral() is already included together with a custom repository (so effectively presence of both mavenCentral() and maven {) and if so, we should probably remove mavenCentral() so that it can then be inserted before the custom repository.

Both approaches seem to be working so far.

clemens commented 1 month ago

Despite my previous comment, I want to add something here that at least to me seems related, because we're facing similar issues for 2 use cases where the root problem is not a private mirror but instead 2 other problems that in the end also boil down to what "equality" means for packages.

Problem 1: Same source, different protocol

Git historically supports different protocols and so do many Git servers, including GitHub and GitLab:

All of these are effectively equivalent in terms of source URL, but seem to be treated as different (as far as I've seen that is even the case when the hash matches).

Note that the cleanup happens for vcsProcessed

Problem 2: Same source, different hashing algorithm

This can e.g. occur when using different package managers (or different versions of the same package manager), as is e.g. very much the case in the JavaScript ecosystem. So if a package is required in the exact same version and bundled once with SHA-1 and once with SHA-512 as hashing algorithm, it's treated as a different package.


In both cases, ORT then fails and complains about duplicate package IDs.

An example for both is mimic-fn that's a (transitive) dependency for multiple of our apps:

Screenshot 2024-07-26 at 14 03 28

Some thoughts on this:

Problem 1:

Problem 2:

The first (= producing our own hashes) would probably also solve the problem that Timo has described where the same package could come from different repositories: As long as the hash matches, it should be treated as the same package.

Aside from adding new behavior, I could imagine also leveraging something we already have, namely curations: If we parse curations during the analyze step, we could not only save some performance (e.g. no need to run things like npm view or process its results if we already have certain data) but also fix duplicates by explicitly curating problematic fields. Imagine in my case above being able to do this:

- id: "NPM::mimic-fn:1.2.0"
  curations:
    source_artifact:
      # I can either unset the values or, if I'm diligent, set them properly
      url: ""
      hash:
        value: ""
        algorithm: ""

If this were to be read during analyze and taken into account in parsePackage, then we could at least handle cases for the fields that can be included in curations, i.e. everything listed in https://raw.githubusercontent.com/oss-review-toolkit/ort/main/integrations/schemas/curations-schema.json.

sschuberth commented 1 month ago

For Java, the approach is a bit more hacky (arguably), where I go into all build.gradle files and insert mavenCentral() before the custom repository

I haven't tested this, but at least for Gradle a cleaner approach might be to use a temporary global initialization script that inserts mavenCentral().

sschuberth commented 1 month ago

Problem 1: Same source, different protocol

IMO, this could and should be solved via #6698.

Problem 2: Same source, different hashing algorithm

An example for both is mimic-fn that's a (transitive) dependency for multiple of our apps:

Unfortunately, for some reason I cannot enlarge your screenshots, so they're unreadable for me.

When vcsProcessed is set, vcs could/should be excluded from equality checks

Actually, I believe we never perform equality checks on non-processed VCS info... what makes you believe we do?

Imagine in my case above being able to do this:

This should actually already be possible. And I believe does not matter really whether curations are applied before or after calling parsePackage() during analysis. You can already use a curation to "settle" on a specific hash algorithm and value for a package, so different package metadata becomes equal, and will be de-duplicated before the download / scanning.

clemens commented 1 month ago

Unfortunately, for some reason I cannot enlarge your screenshots, so they're unreadable for me.

Strange, I've just dragged the image into the GitHub editor like usual. Can you see this one:

ort.png

Actually, I believe we never perform equality checks on non-processed VCS info... what makes you believe we do?

I should have been more precise in my wording: I saw that the packages are stored in a mutableSet. I don't have all that much Kotlin experience, but quick research yielded that sets seem to be using the equals and hash methods of an object to determine whether or not 2 objects are considered equal (and thus wouldn't be inserted into the set twice). I couldn't find either of these 2 methods overridden in Project, so I assume that this makes it so that all attributes of a project are considered when determining whether or not to insert it into the set.

This should actually already be possible.

I couldn't find any way to pass --package-curations-dir to ort analyze – hence it doesn't seem to pick up such settings and then consequently fails when it detects what it considers duplicate packages (= 2 packages with the same ID).

sschuberth commented 1 month ago

Can you see this one:

Yes, that works, thanks!

all attributes of a project are considered when determining whether or not to insert it into the set.

You're right. I now get what you mean, thanks for explaining.

I couldn't find any way to pass --package-curations-dir to ort analyze

The analyzer is the "native" place that originally applies curations. As such, the analyzer takes curations from their default locations, which are all enabled curation providers:

https://github.com/oss-review-toolkit/ort/blob/ee941432f35d022c13549f57994fcf643327a414/plugins/commands/analyzer/src/main/kotlin/AnalyzerCommand.kt#L155-L168

Other command that take a --package-curations-dir only use that option to override existing curations without running the analyzer again; it's basically a work-around for https://github.com/oss-review-toolkit/ort/issues/6188 not being implemented yet.

So, if you want the analyzer to take curation from a "special" / non-default location, instead of using a --package-curations-dir CLI option you should configure one of the curation providers with a custom location:

https://github.com/oss-review-toolkit/ort/blob/ee941432f35d022c13549f57994fcf643327a414/model/src/main/resources/reference.yml#L53-L83