Create a command line flag to specify the maximum depth of analysis

dgutson commented 2 years ago

We sometimes need to get the 1st-level (direct) dependencies of a project. I think that setting a --max-depth N could be helpful, N being 1 in the case of direct dependencies. It also could help to run ORT faster when resolving issues "iteratively" (eg by depth level): first level 1, then level 2, etc. If maintainers don't think having the ability to set an arbitrary depth is useful, then just a --direct-deps-only flag could also be useful.

dgutson commented 2 years ago

(FWIW we're currently working on this, so feel free to assign this to @qequ)

fviernau commented 2 years ago

@dgutson

Would you mind sharing the use case, e.g. for what purpose is the analysis result needed?

I'm asking this, since in case it is used in order to evaluate license compliance, then the definition of the depth may be a bit inaccurate. E.g. if you have a software project and you factor out a component X of that project into a separate library, then the direct dependencies of X, which have been first level before, become 2nd level dependencies of the project. So, the decision whether a package is reported or not would become dependent on the internal sub module structure of the project somewhat.

dgutson commented 2 years ago

Hi @fviernau, sure. I see what you mean. We have the "third party libraries" compliance audits, where we need to fill a number of data about the libraries we use. This is what is under our direct control (choosing what libs to use), rather than modifying those external libs, so this is our action space. We could do this by post-processing the report, but this requires a script and wasting processing power and time by ORT running on libraries we don't need to be analyzed anyways (just the "first level"). Another CLI option could tell ORT where/which are the "internal" dependencies, so depth count could start outside of that set.

fviernau commented 2 years ago

Hi @dgutson

here's a list of approaches to consider, I tried to make it rather complete even if I don't really like some of these:

Add logic to each package manager integration to handle the maxDepth. I have doubts about this due to added complexity / maintenance costs, in particular the cost / benefit ratio as this would IIUC not lead to a significant speed-up for several (most?) the package managers.
Keep the package managers integrations as is, but make the analyzer command filter out the packages, after each respective package manager integration determined the full dependency tree. This is cheap to maintain and reduces the size of the analyzer result, so the succeeding stages (scanner, evaluator, reporter) will speed-up and the report will contain just what you need.
Same as 2. but implement a separate command to filter a given analyzer result by depth, or maybe even highly customizable: pass a .kts script which decides whether a package needs to be filtered out.
Add a depth parameter to the scan command: This only saves effort for scanning, the full dep tree would still be contained in the report.

All in all, what do you think about #3? Would this address your problem sufficiently?

dgutson commented 2 years ago

Hi @fviernau ,

I think that, based on your initial comment, we need to tackle first the separation between "internal/external" libs. This is a key underlying problem in our use case. Internal deps should not count in the depth calculus. And this should be provided in a code-less way (a list of repos, or folders in a case of monorepo, rather than a script). Maybe I should create a separate issue for this?

Only after ORT can know what is internal vs external, we could start addressing the "direct dependency" (or n-depth) issue.

This being said, the direction seems to be towards your 3rd alternative.

tsteenbe commented 2 years ago

If we provide a parameter to reduce the maximum depth of analysis this would change one of the core assumptions of ORT that everything is analyzed (but not necessary scanned). If we change this assumption we would need to update all ORT results to record maximum depth of analysis and we would need to update all reports to make communicate this fact - this is not easy to communicate in SPDX or CycloneDX.

dgutson commented 2 years ago

@tsteenbe my main interest here is direct dependencies (depth 1) because this is what we can control; the generalization to depth N is, a generalization, but I don't have such a strong usecase for that. This usecase is not about generating an SBOM.

tsteenbe commented 2 years ago

@dgutson I understand your use case, you want to make sure we can support it without breaking existing functionality. For me the simplest was to implement this would be to implement a skipTransitiveDependenciesAfterLevel = [int] within the scanner configuration within ort.conf. Everything would still be analyzed but not scanned after a certain level which means we don't break existing assumptions and don't need to touch analyzer or reporter code.

fviernau commented 2 years ago

@dgutson I understand your use case, you want to make sure we can support it without breaking existing functionality. For me the simplest was to implement this would be to implement a skipTransitiveDependenciesAfterLevel = [int] within the scanner configuration within ort.conf. Everything would still be analyzed but not scanned after a certain level which means we don't break existing assumptions and don't need to touch analyzer or reporter code.

That would be a optimization of execution time of scans, but the report would still contain entries for all dependencies. Not sure if @dgutson wants the "level > X" dependencies to compeltely disappear. @dgutson can you clarify?

dgutson commented 2 years ago

I actually don't care about the level > X dependencies; however the "internal vs external" dependencies still hold (internal should not count in the X). Also, running ORT this way is one use case, so I think changing the configuration each time I need to use ORT for a different use case is not a good UX. This is why I suggested a command line flag.

sschuberth commented 2 years ago

I think that setting a --max-depth N could be helpful, N being 1 in the case of direct dependencies.

While I also understand your use case @dgutson, I agree with @tsteenbe's comment that this goes against one of the fundamental design concepts of ORT: We always try to get all information that we can, and then eventually filter down or consider only relevant parts of the data in later steps. This way ORT results are much more transparent and reproducible, and you do not need to record which parts of a project you did not analyze / scan in order to avoid confusion when comparing results from analyzing / scanning the same project with different ORT configuration.

So I would be be :-1: on introducing a --max-depth N option or one that limits to direct dependencies.

However...

we need to tackle first the separation between "internal/external" libs. This is a key underlying problem in our use case. Internal deps should not count in the depth calculus. [...] Maybe I should create a separate issue for this?

I agree that we need to make it easy in ORT to distinguish between internal and external dependencies, as usually differently policy rules apply for these, and to me this smells mostly like https://github.com/oss-review-toolkit/ort/issues/4892.

But I do not yet get how the depth calculus relates to this. Do you really want to look at different depths of transitive dependencies depending on whether a direct dependency is internal or external? Or would it be enough to just distinguish between internal and external, and skip internal completely as the assumption is that internal dependencies were already scanned by ORT as root projects at some point in time, and thus do not need to be scanned as dependencies anymore at all?

sschuberth commented 2 years ago

So I would be be 👎 on introducing a --max-depth N option or one that limits to direct dependencies.

Note to myself, meanwhile this actually was implemented specific to the DotNet / NuGet analyzers to work around an issue with dependency resolution.

sschuberth commented 1 year ago

Note to myself, meanwhile this actually was implemented specific to the DotNet / NuGet analyzers to work around an issue with dependency resolution.

The old DotNet / NuGet analyzers were replaced with a completely new approach that does not have the depth-limitation work around anymore. In general, I don't see us implementing this. It's simply too much work to do for all package managers individually. Post-processing the result for the level of dependencies one is interested in is the was to go despite the "[waste of] processing power and time" mentioned here.

dgutson commented 1 year ago

It's ok, but just for the record, ORT takes a full night (more than 6hs) to do the job for part of our project.

I'm curious about how long it takes for other users?

sschuberth commented 1 year ago

I'm curious about how long it takes for other users?

Feel free to start a discussion.

oss-review-toolkit / ort

Create a command line flag to specify the maximum depth of analysis #5626