oss-review-toolkit / ort

A suite of tools to automate software compliance checks.
https://oss-review-toolkit.org
Apache License 2.0
1.58k stars 308 forks source link

Make all package managers use the new dependency graph format #3825

Open sschuberth opened 3 years ago

sschuberth commented 3 years ago

Currently, only the Gradle analyzer uses the new dependency graph format introduced in https://github.com/oss-review-toolkit/ort/pull/3502. We should take advantage of the new format for more (ultimately all) package manager implementations.

Package managers that still need to be migrated are unchecked in this list:

oheger-bosch commented 3 years ago

To better track the progress, I propose to use this as an umbrella ticket and create separate tickets for specific package managers.

oheger-bosch commented 3 years ago

The rework of the Maven package manager (#3895) is now done (see #3894). Unfortunately, the results from first tests are rather disappointing wrt reduction of file sizes. I don't have concrete numbers yet from a direct comparison (a test build is still running), but the result file in the new format is still quite big.

It seems that the situation for Maven projects is a bit different than for Gradle Android projects. In the latter case, a project typically has high redundancy due to the various scopes with almost identical dependency trees. A Maven module in a multi-module build in contrast has rather disjunct scopes, and packages do not appear multiple times in the dependency tree. As the dependency graph is constructed on a project-base, there is not much benefit from the new format.

There is, however, a lot of redundancy between the dependencies of the single Maven sub projects. So a promising approach could be to construct a global dependency graph for an analyzer run. This would be rather easy with the current DependencyGraphBuilder class; a new builder instance would not be created for each project, but only a single one, and there would need to be a convention to construct unique scope names qualified with project IDs.

A major problem would be that this is quite a big change in the data format of the analyzer and affects other model classes as well. Especially for projects it would be tricky to re-associate them with their part of the dependency graph.

sschuberth commented 3 years ago

There is, however, a lot of redundancy between the dependencies of the single Maven sub projects.

Oh, then I must have misunderstood something about the current new dependency graph format, as I though exactly that is already possible: The re-use of re-occurring subtrees within a single project / scope.

So for example, if we had

project
+- scope
   +- depA
   |  +- depC
   |     +- depD
   |     +- depE
   +- depB
      +- depC
         +- depD
         +- depE

then I though depC (incl. all its transitive dependencies, i.e. the whole subtree) would only be stored once, and referred to from depA and depB. Is this not the case?

oheger-bosch commented 3 years ago

Yes, within a single project, this reuse is possible, and dependencies are shared. What I meant was that there are obviously not so many duplicated elements in the dependency trees of single projects, so the graph format does not have a positive effect.

sschuberth commented 3 years ago

But would it really be such a big change to support this subtree reuse across projects? Wouldn't we basically just need to take care that identifiers for subtrees are unique across projects? And that could easily be archived by not using consecutive indices, but hashes on the subtree instances (maybe hashCode() is good enough), or?

oheger-bosch commented 3 years ago

I can do some experiments in this direction. Seems to be an interesting problem.

sschuberth commented 3 weeks ago

Here's a brief migration guide that I came up with while working on https://github.com/oss-review-toolkit/ort/pull/9073:

  1. Refactor existing code so that all metadata for a dependency is available in a single data class.
  2. Implement a DependencyHandler that returns various ORT model classes for that data class.
  3. In the specific package manager class, create a val graphBuilder = DependencyGraphBuilder(...) member variable that takes an instance of that DependencyHandler.
  4. In resolveDependencies(), call graphBuilder.addDependency(DependencyGraph.qualifyScope(...), ...) on each dependency.
  5. Change creating a Project from setting scopeDependencies to setting scopeNames = graphBuilder.scopesFor(projectId).
  6. Change the returned ProjectAnalyzerResult to take an emptySet() as packages.
  7. Override createPackageManagerResult() with PackageManagerResult(projectResults, graphBuilder.build(), graphBuilder.packages()).