oss-review-toolkit / ort

A suite of tools to automate software compliance checks.
https://oss-review-toolkit.org
Apache License 2.0
1.57k stars 308 forks source link

Improve analysis of maven based projects #7965

Open netomi opened 9 months ago

netomi commented 9 months ago

I tried to use the ORT GH action to analyse a maven project, but it took more than 30 min to run only the analysis inside a GitHub runner (https://github.com/netomi/macos-notarization-service/actions/runs/7006736431).

The project has some additional repository setup and ORT seems to try find out from which repository a dependency is coming from by trying to download the dependency from each configured repository. There seems to be also some network throttling in place when run as a GH action, when running the same analysis locally, it completed in a couple of minutes.

However, there should be a way to speed this up and I worked on improving the resolution of dependencies in maven projects. In my fork at https://github.com/netomi/ort/tree/disable-remote-verification I did some experiments to get the resolved repository from maven itself (it stores that information in the _remote.repositories file in the local cache).

With these changes the run of the analysis on GitHub could be completed in around 2min, see https://github.com/netomi/macos-notarization-service/actions/runs/7021936345).

My approach so far is quick and dirty and more a PoC, but I will be working on a PR to make this as clean as possible.

sschuberth commented 9 months ago

My approach so far is quick and dirty and more a PoC, but I will be working on a PR to make this as clean as possible.

Looking forward to your PR, thanks a lot for looking into improving things!

As a semi-related side, I've been long aware that the Maven code probably is more complex than it needs to be by now, also see https://github.com/oss-review-toolkit/ort/issues/4720.

netomi commented 9 months ago

My idea so far that worked is to let maven resolve all dependencies by itself which it can do best it seems (e.g. running a normal build is also fast, even when the artifacts are not cached). Afterwards try to get the repository that maven used to download the dependency from.

While debugging, it seems that the DefaultArtifactResolver does correctly set the repository, however this information is later lost, the artifacts that you receive from the project building request do not have a repository field being set though internally that information is known to maven.

This feels like a bug in maven, I could workaround that by registering a repositoryListener that gets called whenever an artifact is resolved. There the repository information is still present.

mnonnenmacher commented 9 months ago

I tried to use the ORT GH action to analyse a maven project, but it took more than 30 min to run only the analysis inside a GitHub runner (https://github.com/netomi/macos-notarization-service/actions/runs/7006736431).

Such a slow resolution for Maven projects is unusual and often an indication of an issue with one of the used repositories. For example, sometimes repositories are configured which are not available anymore which leads to lots of time consuming network timeouts. I had a short look at the logs and it seems that the majority of the time is spent waiting for https://repository.jboss.org/. Each request takes several seconds, sometimes more than 10. As an example:

15:07:23.410 [DefaultDispatcher-worker-1] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Connection request: [route: {s}->https://repository.jboss.org:443][total available: 4; route allocated: 1 of 50; total allocated: 4 of 100]
15:07:23.410 [DefaultDispatcher-worker-1] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Connection leased: [id: 44][route: {s}->https://repository.jboss.org:443][total available: 3; route allocated: 1 of 50; total allocated: 4 of 100]
15:07:23.410 [DefaultDispatcher-worker-1] DEBUG org.apache.http.impl.conn.DefaultManagedHttpClientConnection - http-outgoing-44: set socket timeout to 0
15:07:23.410 [DefaultDispatcher-worker-1] DEBUG org.apache.http.impl.conn.DefaultManagedHttpClientConnection - http-outgoing-44: set socket timeout to 1800000
15:07:23.410 [DefaultDispatcher-worker-1] DEBUG org.apache.http.impl.execchain.MainClientExec - Executing request GET /nexus/content/groups/public/org/apache/maven/maven/3.9.3/maven-3.9.3.pom HTTP/1.1
15:07:23.410 [DefaultDispatcher-worker-1] DEBUG org.apache.http.impl.execchain.MainClientExec - Target auth state: UNCHALLENGED
15:07:23.410 [DefaultDispatcher-worker-1] DEBUG org.apache.http.impl.execchain.MainClientExec - Proxy auth state: UNCHALLENGED
15:07:28.112 [DefaultDispatcher-worker-1] DEBUG org.apache.http.impl.execchain.MainClientExec - Connection can be kept alive indefinitely
15:07:28.113 [DefaultDispatcher-worker-1] DEBUG org.apache.http.impl.conn.DefaultManagedHttpClientConnection - http-outgoing-44: Close connection
15:07:28.113 [DefaultDispatcher-worker-1] DEBUG org.apache.http.impl.execchain.MainClientExec - Connection discarded
15:07:28.113 [DefaultDispatcher-worker-1] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Connection released: [id: 44][route: {s}->https://repository.jboss.org:443][total available: 3; route allocated: 0 of 50; total allocated: 3 of 100]

Requests to other repositories take only a few milliseconds. I wonder if the JBoss repository is always so slow or if this is a temporary issue. What I would investigate is:

The project has some additional repository setup and ORT seems to try find out from which repository a dependency is coming from by trying to download the dependency from each configured repository.

ORT only downloads the POM files for dependencies, for other artifacts only the checksums are downloaded. See: https://github.com/oss-review-toolkit/ort/blob/a464678e9f6f9c3fb298ba2be265dbcb28ca4922/plugins/package-managers/maven/src/main/kotlin/utils/MavenSupport.kt#L862-L893

However, there should be a way to speed this up and I worked on improving the resolution of dependencies in maven projects. In my fork at https://github.com/netomi/ort/tree/disable-remote-verification I did some experiments to get the resolved repository from maven itself (it stores that information in the _remote.repositories file in the local cache).

This could be an issue with setups where the local Maven repository is kept for ORT scans of different Maven projects. If those projects use different repositories for the same artifacts (like a mirror of Maven central) the information from the local cache might be wrong, because Maven did not attempt a new resolution of artifacts that were already cached.

netomi commented 9 months ago

The project only has one additional repository setup: repo.eclipse.org for one dependency. The jboss repository probably comes from another dependency's pom and is not used generally. One negative side-effect of the current approach is that ORT tries to download from the repositories that are defined for the project in the order as it retrieved from the maven API. In this project, the repo.eclipse.org repository is first in the list, thus is always tried first to resolve the artifact, which is a waste ofc.

I understand that the information in the local maven cache might not reflect the real situation, however using ort like it is currently done does not seem to be working as expected when used in a GitHub Action. The existing ORT action does already some caching but it only affects the ORT job so I would not expect any negative side-effect of it. Maybe a special option could be added to enable this optimization when you are sure that the local maven cache will not have side-effects from other projects?

netomi commented 9 months ago

ORT only downloads the POM files for dependencies, for other artifacts only the checksums are downloaded. See:

I dont think that this is true, looking at https://github.com/oss-review-toolkit/ort/blob/a464678e9f6f9c3fb298ba2be265dbcb28ca4922/plugins/package-managers/maven/src/main/kotlin/utils/MavenSupport.kt#L722-L738 and the code in requestRemoteArtifact looks like that the binary and even source archives are tried to be downloaded from all repositories to determine from which repository it comes from.

mnonnenmacher commented 9 months ago

The project only has one additional repository setup: repo.eclipse.org for one dependency. The jboss repository probably comes from another dependency's pom and is not used generally. One negative side-effect of the current approach is that ORT tries to download from the repositories that are defined for the project in the order as it retrieved from the maven API. In this project, the repo.eclipse.org repository is first in the list, thus is always tried first to resolve the artifact, which is a waste ofc.

This is exactly what Maven does during dependency resolution. If the POM file of a dependency defines repositories, they are appended to the list of repositories and used to resolve the transitive dependencies of that dependency (but not for other branches of the dependency tree).

I understand that the information in the local maven cache might not reflect the real situation, however using ort like it is currently done does not seem to be working as expected when used in a GitHub Action. The existing ORT action does already some caching but it only affects the ORT job so I would not expect any negative side-effect of it. Maybe a special option could be added to enable this optimization when you are sure that the local maven cache will not have side-effects from other projects?

Yes, making this optimization optional would be a requirement for some ORT users, but I agree that for the GitHub action it could be a useful option.

mnonnenmacher commented 9 months ago

ORT only downloads the POM files for dependencies, for other artifacts only the checksums are downloaded. See:

I dont think that this is true, looking at

https://github.com/oss-review-toolkit/ort/blob/a464678e9f6f9c3fb298ba2be265dbcb28ca4922/plugins/package-managers/maven/src/main/kotlin/utils/MavenSupport.kt#L722-L738 and the code in requestRemoteArtifact looks like that the binary and even source archives are tried to be downloaded from all repositories to determine from which repository it comes from.

The requestRemoteArtifact function returns on the first hit, see: https://github.com/oss-review-toolkit/ort/blob/a9515337b079c3db7c37e594e0c16b68a1dfe3b8/plugins/package-managers/maven/src/main/kotlin/utils/MavenSupport.kt#L635-L638

It also does not actually download the artifact but only does an existence check, see: https://github.com/oss-review-toolkit/ort/blob/a9515337b079c3db7c37e594e0c16b68a1dfe3b8/plugins/package-managers/maven/src/main/kotlin/utils/MavenSupport.kt#L583-L584

netomi commented 9 months ago

The project only has one additional repository setup: repo.eclipse.org for one dependency. The jboss repository probably comes from another dependency's pom and is not used generally. One negative side-effect of the current approach is that ORT tries to download from the repositories that are defined for the project in the order as it retrieved from the maven API. In this project, the repo.eclipse.org repository is first in the list, thus is always tried first to resolve the artifact, which is a waste ofc.

This is exactly what Maven does during dependency resolution. If the POM file of a dependency defines repositories, they are appended to the list of repositories and used to resolve the transitive dependencies of that dependency (but not for other branches of the dependency tree).

The order how these repos are tried to resolve an artifact matters. If any artifact is first tried on a repo that only hosts a single dependency but mavencentral is tried last, you are wasting a lot of connections while trying to resolve your artifacts, which in the case of GitHub can have all kinds of side-effect (runner might get throttled). It looks like the maven API that ORT uses returns the repos in LIFO order, so mavencentral is defined in the super pom, but in the list of repos its always last, so ORT tries mavencentral as last resort which is kind of counter-intuitive as mavencentral should be the first option. But thats just what I have seen from debugging this stuff for 1-2 days, will need certainly more time to understand what is going on.

Yes, making this optimization optional would be a requirement for some ORT users, but I agree that for the GitHub action it could be a useful option.

Will need to think about it, but my idea is that maven does a very good job to resolve artifacts and has proven this over many years, any custom resolution mechanism on top of it will be most likely be worse in terms of performance, especially in constrained environements like CI systems.

sschuberth commented 9 months ago

ORT tries mavencentral as last resort which is kind of counter-intuitive as mavencentral should be the first option.

As @mnonnenmacher pointed out, ORT does here the same thing as Maven itself would do: Repositories that are defined "closer" to the project / dependency in question are tried first. Otherwise, projects would have no chance of overriding an artifact that also lives in Maven Central.

netomi commented 9 months ago

ok ty, I did some more digging into the logs from ORT and maven for resolving the dependencies and the way the artifacts are resolved is pretty much the same. Maven seems to do it much faster though. Will need to understand what is different. Connections are dropped / released quite often and have to be re-established, which could trigger some protection mechanisms on some repos.