Closed basil closed 1 year ago
Quick report of the actions performed Friday 30 of March by the Jenkins Infra team:
=> there is no "easy and obvious" cause
I heard @lemeurherve mentioning the builds were back to normal despite https://github.com/jenkinsci/bom/pull/1921 but I'm not sure: I'll let him report here monday (as the current main build of the BOM seems to succeed: we can wait for the week-end to pass).
Next steps:
sendfile
mode and tcp_nopush
couldn't have impact with the Maven's HTTP client behavior (see. https://github.com/jenkins-infra/helm-charts/blob/b367f32aab79fd7a9f1d9d3e8d7968b4ced63f0b/charts/artifact-caching-proxy/templates/nginx-default-configmap.yaml#L32-L37).there is no "easy and obvious" cause
Did we consider the possibility of file descriptor exhaustion? Not sure what RLIMIT_NOFILE
is on the nginx(8)
process, but worker_connections
(which is bounded by RLIMIT_NOFILE
) is set to 1024. Maven's aether.connector.basic.threads
defaults to 5, meaning that a proxy with 1024 workers could handle ~200 simultaneous Maven builds. That seems to be in the same order of magnitude as the number of concurrent PCT jobs we run. See https://nginx.org/en/docs/http/ngx_http_stub_status_module.html for how to get this information from nginx(8)
, or simply run ls -1 /proc/${PID}/fd | wc -l
against the nginx(8)
worker (not primary!) process during a busy BOM build for a crude approximation. If file descriptors are exhausted you could increase RLIMIT_NOFILE
and/or worker_connections
, or apply backpressure by passing e.g. -Daether.connector.basic.threads=2
to Maven to halve the number of concurrent requests.
I heard @lemeurherve mentioning the builds were back to normal despite jenkinsci/bom#1921 but I'm not sure: I'll let him report here monday (as the current main build of the BOM seems to succeed: we can wait for the week-end to pass).
I did not see any report today, but I am assuming the reasoning is that the Maven upgrade restored retries and therefore stability. That is possible; it could be confirmed by reverting jenkinsci/bom#1921 and seeing if the problem returns. I have never seen a failure with the old version of Maven and repo.jenkins-ci.org
(Artifactory) though, only with the artifact caching proxy. In any case, the Maven retry behavior is simply a workaround on top of the real problem: it should not be necessary to retry downloads in the first place. They should suceed on the first try. To phrase this another way, the artifact caching proxy is a server built and hosted by us entirely for our own needs (and not accessible to anyone else), so we ought to be able to engineer it to meet our needs perfectly on the first try. This can be achieved using suitable hardware & networking resources combined with OS and userland configuration, either scaling horizontally (e.g., multiple proxies with a load balancer) or vertically (e.g., a single proxy with sufficient resources to meet demand).
Closing as the "unreliable" behavior (which is BOM-only) is tracked in #3481
A theory of file descriptor exhaustion is consistent with the pathology being only observable in BOM, as BOM is a pretty good stress test for the web server.
I'm not sure about the RLIMIT_NOFILE
: the worker_connections is set to 1024
, that is true, but each ACP instance has 4 workers so 4096, and we have 2 ACP replicas per "cloud" with RR load balanced requests.
Also, Nginx logs this error when it happens (faced it before many times) so I'm surprised that we did not see any error about this.
However, I realize we did NOT check for the ingress (which happen to be Nginx) configuration that might be the culprit (we do not hit ACP directly).
Anyway that could be worth it to:
Also, there are fundamentals questions at stake:
Please note that with https://github.com/jenkins-infra/helpdesk/issues/3487, investing too much time in DigitalOcean might be a waste as we are not sure if we'll get more credits.
Metrics on the ACP side confirms that the 30th of March, the cumulated amount of active connections on the whole ACP for DigitalOcean never crossed the threshold of 400 simultaneously.
However, it seems there is a lot of connections "waiting" which goes into the direction of https://github.com/jenkins-infra/helm-charts/pull/471
Is it worth trying an artifactory instance running on each cloud provider as a proxy instead of nginx?
I assume nginx was used as that's what was used previously?
Is it worth trying an artifactory instance running on each cloud provider as a proxy instead of nginx?
I assume nginx was used as that's what was used previously?
Yes, but not only:
I'm really bothered by this issue: there is nothing pointing finger at ACP specifically (otherwise the issue would have been seen in AWS as well since quite some time) so it really looks a weird issue.
The maintenance of Artifactory is a lot of time to spend on. We can't compare to repo.jenkins-ci.org which has engineering team behind and a decade worth of tuning (and a cloud bill through the roof...).
We've ran it in a pod on Kubernetes for about 4 years and can count on one hand the number of times we've had to touch it. What maintenance are you thinking of?
The maintenance of Artifactory is a lot of time to spend on. We can't compare to repo.jenkins-ci.org which has engineering team behind and a decade worth of tuning (and a cloud bill through the roof...).
We've ran it in a pod on Kubernetes for about 4 years and can count on one hand the number of times we've had to touch it. What maintenance are you thinking of?
these are the maintenance I'm thinking of.
(edit)
I tried enabling the artifact caching proxy in ATH and got the same error, even with Maven 3.9.1. So Maven 3.9.1 is not the solution.
[2023-04-05T01:40:59.059Z] + mvn -B -V -e -ntp -s /home/jenkins/workspace/_acceptance-test-harness_PR-1097@tmp/config4855491279108139757tmp test -Dmaven.repo.local=/home/jenkins/workspace/_acceptance-test-harness_PR-1097@tmp/m2repo -Dmaven.test.failure.ignore=true -DforkCount=1
[2023-04-05T01:40:59.314Z] Apache Maven 3.9.1 (2e178502fcdbffc201671fb2537d0cb4b4cc58f8)
[2023-04-05T01:40:59.314Z] Maven home: /opt/maven
[2023-04-05T01:40:59.314Z] Java version: 11.0.18, vendor: Ubuntu, runtime: /usr/lib/jvm/java-11-openjdk-amd64
[2023-04-05T01:40:59.314Z] Default locale: en_US, platform encoding: ANSI_X3.4-1968
[2023-04-05T01:40:59.314Z] OS name: "linux", version: "5.15.0-1035-azure", arch: "amd64", family: "unix"
[2023-04-05T01:41:01.871Z] [ERROR] Error executing Maven.
[2023-04-05T01:41:01.871Z] org.apache.maven.cli.internal.ExtensionResolutionException: Extension io.jenkins.tools.incrementals:git-changelist-maven-extension:1.6 or one of its dependencies could not be resolved: The following artifacts could not be resolved: org.apache.httpcomponents:httpclient:jar:4.3.6 (absent): Could not transfer artifact org.apache.httpcomponents:httpclient:jar:4.3.6 from/to azure-proxy (https://repo.azure.jenkins.io/public/): Connection reset
[2023-04-05T01:41:01.871Z] at org.apache.maven.cli.internal.BootstrapCoreExtensionManager.resolveExtension (BootstrapCoreExtensionManager.java:182)
[2023-04-05T01:41:01.871Z] at org.apache.maven.cli.internal.BootstrapCoreExtensionManager.resolveCoreExtensions (BootstrapCoreExtensionManager.java:115)
[2023-04-05T01:41:01.871Z] at org.apache.maven.cli.internal.BootstrapCoreExtensionManager.loadCoreExtensions (BootstrapCoreExtensionManager.java:99)
[2023-04-05T01:41:01.871Z] at org.apache.maven.cli.MavenCli.loadCoreExtensions (MavenCli.java:677)
[2023-04-05T01:41:01.871Z] at org.apache.maven.cli.MavenCli.container (MavenCli.java:548)
[2023-04-05T01:41:01.871Z] at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:266)
[2023-04-05T01:41:01.871Z] at org.apache.maven.cli.MavenCli.main (MavenCli.java:195)
[2023-04-05T01:41:01.871Z] at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
[2023-04-05T01:41:01.871Z] at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
[2023-04-05T01:41:01.871Z] at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
[2023-04-05T01:41:01.871Z] at java.lang.reflect.Method.invoke (Method.java:566)
[2023-04-05T01:41:01.871Z] at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:282)
[2023-04-05T01:41:01.871Z] at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:225)
[2023-04-05T01:41:01.871Z] at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:406)
[2023-04-05T01:41:01.871Z] at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:347)
[2023-04-05T01:41:01.871Z] Caused by: org.eclipse.aether.resolution.ArtifactResolutionException: The following artifacts could not be resolved: org.apache.httpcomponents:httpclient:jar:4.3.6 (absent): Could not transfer artifact org.apache.httpcomponents:httpclient:jar:4.3.6 from/to azure-proxy (https://repo.azure.jenkins.io/public/): Connection reset
[2023-04-05T01:41:01.871Z] at org.eclipse.aether.internal.impl.DefaultArtifactResolver.resolve (DefaultArtifactResolver.java:456)
[2023-04-05T01:41:01.871Z] at org.eclipse.aether.internal.impl.DefaultArtifactResolver.resolveArtifacts (DefaultArtifactResolver.java:261)
[2023-04-05T01:41:01.871Z] at org.eclipse.aether.internal.impl.DefaultRepositorySystem.resolveDependencies (DefaultRepositorySystem.java:352)
[2023-04-05T01:41:01.871Z] at org.apache.maven.plugin.internal.DefaultPluginDependenciesResolver.resolveInternal (DefaultPluginDependenciesResolver.java:212)
[2023-04-05T01:41:01.871Z] at org.apache.maven.plugin.internal.DefaultPluginDependenciesResolver.resolveCoreExtension (DefaultPluginDependenciesResolver.java:148)
[2023-04-05T01:41:01.871Z] at org.apache.maven.cli.internal.BootstrapCoreExtensionManager.resolveExtension (BootstrapCoreExtensionManager.java:174)
[2023-04-05T01:41:01.871Z] at org.apache.maven.cli.internal.BootstrapCoreExtensionManager.resolveCoreExtensions (BootstrapCoreExtensionManager.java:115)
[2023-04-05T01:41:01.871Z] at org.apache.maven.cli.internal.BootstrapCoreExtensionManager.loadCoreExtensions (BootstrapCoreExtensionManager.java:99)
[2023-04-05T01:41:01.871Z] at org.apache.maven.cli.MavenCli.loadCoreExtensions (MavenCli.java:677)
[2023-04-05T01:41:01.871Z] at org.apache.maven.cli.MavenCli.container (MavenCli.java:548)
[2023-04-05T01:41:01.871Z] at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:266)
[2023-04-05T01:41:01.871Z] at org.apache.maven.cli.MavenCli.main (MavenCli.java:195)
[2023-04-05T01:41:01.871Z] at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
[2023-04-05T01:41:01.871Z] at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
[2023-04-05T01:41:01.871Z] at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
[2023-04-05T01:41:01.871Z] at java.lang.reflect.Method.invoke (Method.java:566)
[2023-04-05T01:41:01.871Z] at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:282)
[2023-04-05T01:41:01.871Z] at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:225)
[2023-04-05T01:41:01.871Z] at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:406)
[2023-04-05T01:41:01.871Z] at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:347)
[2023-04-05T01:41:01.871Z] Caused by: org.eclipse.aether.transfer.ArtifactTransferException: Could not transfer artifact org.apache.httpcomponents:httpclient:jar:4.3.6 from/to azure-proxy (https://repo.azure.jenkins.io/public/): Connection reset
[2023-04-05T01:41:01.871Z] at org.eclipse.aether.connector.basic.ArtifactTransportListener.transferFailed (ArtifactTransportListener.java:44)
[2023-04-05T01:41:01.871Z] at org.eclipse.aether.connector.basic.BasicRepositoryConnector$TaskRunner.run (BasicRepositoryConnector.java:417)
[2023-04-05T01:41:01.871Z] at org.eclipse.aether.util.concurrency.RunnableErrorForwarder.lambda$wrap$0 (RunnableErrorForwarder.java:66)
[2023-04-05T01:41:01.871Z] at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1128)
[2023-04-05T01:41:01.871Z] at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:628)
[2023-04-05T01:41:01.871Z] at java.lang.Thread.run (Thread.java:829)
[2023-04-05T01:41:01.871Z] Caused by: java.net.SocketException: Connection reset
[2023-04-05T01:41:01.871Z] at java.net.SocketInputStream.read (SocketInputStream.java:186)
[2023-04-05T01:41:01.871Z] at java.net.SocketInputStream.read (SocketInputStream.java:140)
[2023-04-05T01:41:01.871Z] at sun.security.ssl.SSLSocketInputRecord.read (SSLSocketInputRecord.java:484)
[2023-04-05T01:41:01.871Z] at sun.security.ssl.SSLSocketInputRecord.readFully (SSLSocketInputRecord.java:467)
[2023-04-05T01:41:01.871Z] at sun.security.ssl.SSLSocketInputRecord.decodeInputRecord (SSLSocketInputRecord.java:243)
[2023-04-05T01:41:01.871Z] at sun.security.ssl.SSLSocketInputRecord.decode (SSLSocketInputRecord.java:181)
[2023-04-05T01:41:01.871Z] at sun.security.ssl.SSLTransport.decode (SSLTransport.java:111)
[2023-04-05T01:41:01.871Z] at sun.security.ssl.SSLSocketImpl.decode (SSLSocketImpl.java:1510)
[2023-04-05T01:41:01.871Z] at sun.security.ssl.SSLSocketImpl.readApplicationRecord (SSLSocketImpl.java:1477)
[2023-04-05T01:41:01.871Z] at sun.security.ssl.SSLSocketImpl$AppInputStream.read (SSLSocketImpl.java:1066)
[2023-04-05T01:41:01.871Z] at org.apache.http.impl.io.SessionInputBufferImpl.streamRead (SessionInputBufferImpl.java:137)
[2023-04-05T01:41:01.872Z] at org.apache.http.impl.io.SessionInputBufferImpl.read (SessionInputBufferImpl.java:197)
[2023-04-05T01:41:01.872Z] at org.apache.http.impl.io.ContentLengthInputStream.read (ContentLengthInputStream.java:176)
[2023-04-05T01:41:01.872Z] at org.apache.http.conn.EofSensorInputStream.read (EofSensorInputStream.java:135)
[2023-04-05T01:41:01.872Z] at org.apache.http.conn.EofSensorInputStream.read (EofSensorInputStream.java:148)
[2023-04-05T01:41:01.872Z] at org.eclipse.aether.spi.connector.transport.AbstractTransporter.copy (AbstractTransporter.java:163)
[2023-04-05T01:41:01.872Z] at org.eclipse.aether.spi.connector.transport.AbstractTransporter.utilGet (AbstractTransporter.java:94)
[2023-04-05T01:41:01.872Z] at org.eclipse.aether.transport.http.HttpTransporter.access$200 (HttpTransporter.java:96)
[2023-04-05T01:41:01.872Z] at org.eclipse.aether.transport.http.HttpTransporter$EntityGetter.handle (HttpTransporter.java:565)
[2023-04-05T01:41:01.872Z] at org.eclipse.aether.transport.http.HttpTransporter.execute (HttpTransporter.java:374)
[2023-04-05T01:41:01.872Z] at org.eclipse.aether.transport.http.HttpTransporter.implGet (HttpTransporter.java:319)
[2023-04-05T01:41:01.872Z] at org.eclipse.aether.spi.connector.transport.AbstractTransporter.get (AbstractTransporter.java:63)
[2023-04-05T01:41:01.872Z] at org.eclipse.aether.connector.basic.BasicRepositoryConnector$GetTaskRunner.runTask (BasicRepositoryConnector.java:482)
[2023-04-05T01:41:01.872Z] at org.eclipse.aether.connector.basic.BasicRepositoryConnector$TaskRunner.run (BasicRepositoryConnector.java:414)
[2023-04-05T01:41:01.872Z] at org.eclipse.aether.util.concurrency.RunnableErrorForwarder.lambda$wrap$0 (RunnableErrorForwarder.java:66)
[2023-04-05T01:41:01.872Z] at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1128)
[2023-04-05T01:41:01.872Z] at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:628)
[2023-04-05T01:41:01.872Z] at java.lang.Thread.run (Thread.java:829)
Seeing the same error in ATH even in Azure with Maven 3.9.1 in https://ci.jenkins.io/job/Core/job/acceptance-test-harness/job/PR-1097/10/ -- giving up on enabling artifact caching proxy for ATH.
[2023-04-05T02:54:23.756Z] + eval export BROWSER_DISPLAY=:42 export DISPLAY=:42
[2023-04-05T02:54:23.756Z] ++ export BROWSER_DISPLAY=:42 export DISPLAY=:42
[2023-04-05T02:54:23.756Z] ++ BROWSER_DISPLAY=:42
[2023-04-05T02:54:23.756Z] ++ DISPLAY=:42
[2023-04-05T02:54:23.756Z] + java -version
[2023-04-05T02:54:23.756Z] openjdk version "11.0.18" 2023-01-17
[2023-04-05T02:54:23.756Z] OpenJDK Runtime Environment (build 11.0.18+10-post-Ubuntu-0ubuntu122.04)
[2023-04-05T02:54:23.756Z] OpenJDK 64-Bit Server VM (build 11.0.18+10-post-Ubuntu-0ubuntu122.04, mixed mode, sharing)
[2023-04-05T02:54:23.756Z] + run.sh firefox lts -Dmaven.repo.local=/home/jenkins/workspace/_acceptance-test-harness_PR-1097@tmp/m2repo -Dmaven.test.failure.ignore=true -DforkCount=1
[2023-04-05T02:54:23.756Z] + BROWSER=firefox
[2023-04-05T02:54:23.756Z] + JENKINS_WAR=target/jenkins-war.war
[2023-04-05T02:54:23.756Z] + mvn -B -V -e -ntp -s /home/jenkins/workspace/_acceptance-test-harness_PR-1097@tmp/config11205400074458065414tmp test -Plts -Dmaven.repo.local=/home/jenkins/workspace/_acceptance-test-harness_PR-1097@tmp/m2repo -Dmaven.test.failure.ignore=true -DforkCount=1
[2023-04-05T02:54:24.022Z] Apache Maven 3.9.1 (2e178502fcdbffc201671fb2537d0cb4b4cc58f8)
[2023-04-05T02:54:24.022Z] Maven home: /opt/maven
[2023-04-05T02:54:24.022Z] Java version: 11.0.18, vendor: Ubuntu, runtime: /usr/lib/jvm/java-11-openjdk-amd64
[2023-04-05T02:54:24.022Z] Default locale: en_US, platform encoding: ANSI_X3.4-1968
[2023-04-05T02:54:24.022Z] OS name: "linux", version: "5.15.0-1035-azure", arch: "amd64", family: "unix"
[2023-04-05T02:54:26.050Z] [ERROR] Error executing Maven.
[2023-04-05T02:54:26.050Z] org.apache.maven.cli.internal.ExtensionResolutionException: Extension io.jenkins.tools.incrementals:git-changelist-maven-extension:1.6 or one of its dependencies could not be resolved: The following artifacts could not be resolved: com.google.inject:guice:jar:5.1.0 (absent): Could not transfer artifact com.google.inject:guice:jar:5.1.0 from/to azure-proxy (https://repo.azure.jenkins.io/public/): Connection reset
[2023-04-05T02:54:26.050Z] at org.apache.maven.cli.internal.BootstrapCoreExtensionManager.resolveExtension (BootstrapCoreExtensionManager.java:182)
[2023-04-05T02:54:26.050Z] at org.apache.maven.cli.internal.BootstrapCoreExtensionManager.resolveCoreExtensions (BootstrapCoreExtensionManager.java:115)
[2023-04-05T02:54:26.050Z] at org.apache.maven.cli.internal.BootstrapCoreExtensionManager.loadCoreExtensions (BootstrapCoreExtensionManager.java:99)
[2023-04-05T02:54:26.050Z] at org.apache.maven.cli.MavenCli.loadCoreExtensions (MavenCli.java:677)
[2023-04-05T02:54:26.050Z] at org.apache.maven.cli.MavenCli.container (MavenCli.java:548)
[2023-04-05T02:54:26.050Z] at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:266)
[2023-04-05T02:54:26.050Z] at org.apache.maven.cli.MavenCli.main (MavenCli.java:195)
[2023-04-05T02:54:26.050Z] at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
[2023-04-05T02:54:26.050Z] at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
[2023-04-05T02:54:26.050Z] at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
[2023-04-05T02:54:26.050Z] at java.lang.reflect.Method.invoke (Method.java:566)
[2023-04-05T02:54:26.050Z] at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:282)
[2023-04-05T02:54:26.050Z] at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:225)
[2023-04-05T02:54:26.050Z] at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:406)
[2023-04-05T02:54:26.050Z] at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:347)
[2023-04-05T02:54:26.050Z] Caused by: org.eclipse.aether.resolution.ArtifactResolutionException: The following artifacts could not be resolved: com.google.inject:guice:jar:5.1.0 (absent): Could not transfer artifact com.google.inject:guice:jar:5.1.0 from/to azure-proxy (https://repo.azure.jenkins.io/public/): Connection reset
[2023-04-05T02:54:26.050Z] at org.eclipse.aether.internal.impl.DefaultArtifactResolver.resolve (DefaultArtifactResolver.java:456)
[2023-04-05T02:54:26.050Z] at org.eclipse.aether.internal.impl.DefaultArtifactResolver.resolveArtifacts (DefaultArtifactResolver.java:261)
[2023-04-05T02:54:26.050Z] at org.eclipse.aether.internal.impl.DefaultRepositorySystem.resolveDependencies (DefaultRepositorySystem.java:352)
[2023-04-05T02:54:26.050Z] at org.apache.maven.plugin.internal.DefaultPluginDependenciesResolver.resolveInternal (DefaultPluginDependenciesResolver.java:212)
[2023-04-05T02:54:26.050Z] at org.apache.maven.plugin.internal.DefaultPluginDependenciesResolver.resolveCoreExtension (DefaultPluginDependenciesResolver.java:148)
[2023-04-05T02:54:26.050Z] at org.apache.maven.cli.internal.BootstrapCoreExtensionManager.resolveExtension (BootstrapCoreExtensionManager.java:174)
[2023-04-05T02:54:26.050Z] at org.apache.maven.cli.internal.BootstrapCoreExtensionManager.resolveCoreExtensions (BootstrapCoreExtensionManager.java:115)
[2023-04-05T02:54:26.050Z] at org.apache.maven.cli.internal.BootstrapCoreExtensionManager.loadCoreExtensions (BootstrapCoreExtensionManager.java:99)
[2023-04-05T02:54:26.050Z] at org.apache.maven.cli.MavenCli.loadCoreExtensions (MavenCli.java:677)
[2023-04-05T02:54:26.050Z] at org.apache.maven.cli.MavenCli.container (MavenCli.java:548)
[2023-04-05T02:54:26.050Z] at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:266)
[2023-04-05T02:54:26.050Z] at org.apache.maven.cli.MavenCli.main (MavenCli.java:195)
[2023-04-05T02:54:26.050Z] at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
[2023-04-05T02:54:26.050Z] at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
[2023-04-05T02:54:26.050Z] at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
[2023-04-05T02:54:26.050Z] at java.lang.reflect.Method.invoke (Method.java:566)
[2023-04-05T02:54:26.051Z] at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:282)
[2023-04-05T02:54:26.051Z] at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:225)
[2023-04-05T02:54:26.051Z] at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:406)
[2023-04-05T02:54:26.051Z] at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:347)
[2023-04-05T02:54:26.051Z] Caused by: org.eclipse.aether.transfer.ArtifactTransferException: Could not transfer artifact com.google.inject:guice:jar:5.1.0 from/to azure-proxy (https://repo.azure.jenkins.io/public/): Connection reset
[2023-04-05T02:54:26.051Z] at org.eclipse.aether.connector.basic.ArtifactTransportListener.transferFailed (ArtifactTransportListener.java:44)
[2023-04-05T02:54:26.051Z] at org.eclipse.aether.connector.basic.BasicRepositoryConnector$TaskRunner.run (BasicRepositoryConnector.java:417)
[2023-04-05T02:54:26.051Z] at org.eclipse.aether.util.concurrency.RunnableErrorForwarder.lambda$wrap$0 (RunnableErrorForwarder.java:66)
[2023-04-05T02:54:26.051Z] at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1128)
[2023-04-05T02:54:26.051Z] at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:628)
[2023-04-05T02:54:26.051Z] at java.lang.Thread.run (Thread.java:829)
[2023-04-05T02:54:26.051Z] Caused by: java.net.SocketException: Connection reset
[2023-04-05T02:54:26.051Z] at java.net.SocketInputStream.read (SocketInputStream.java:186)
[2023-04-05T02:54:26.051Z] at java.net.SocketInputStream.read (SocketInputStream.java:140)
[2023-04-05T02:54:26.051Z] at sun.security.ssl.SSLSocketInputRecord.read (SSLSocketInputRecord.java:484)
[2023-04-05T02:54:26.051Z] at sun.security.ssl.SSLSocketInputRecord.readFully (SSLSocketInputRecord.java:467)
[2023-04-05T02:54:26.051Z] at sun.security.ssl.SSLSocketInputRecord.decodeInputRecord (SSLSocketInputRecord.java:243)
[2023-04-05T02:54:26.051Z] at sun.security.ssl.SSLSocketInputRecord.decode (SSLSocketInputRecord.java:181)
[2023-04-05T02:54:26.051Z] at sun.security.ssl.SSLTransport.decode (SSLTransport.java:111)
[2023-04-05T02:54:26.051Z] at sun.security.ssl.SSLSocketImpl.decode (SSLSocketImpl.java:1510)
[2023-04-05T02:54:26.051Z] at sun.security.ssl.SSLSocketImpl.readApplicationRecord (SSLSocketImpl.java:1477)
[2023-04-05T02:54:26.051Z] at sun.security.ssl.SSLSocketImpl$AppInputStream.read (SSLSocketImpl.java:1066)
[2023-04-05T02:54:26.051Z] at org.apache.http.impl.io.SessionInputBufferImpl.streamRead (SessionInputBufferImpl.java:137)
[2023-04-05T02:54:26.051Z] at org.apache.http.impl.io.SessionInputBufferImpl.read (SessionInputBufferImpl.java:197)
[2023-04-05T02:54:26.051Z] at org.apache.http.impl.io.ContentLengthInputStream.read (ContentLengthInputStream.java:176)
[2023-04-05T02:54:26.051Z] at org.apache.http.conn.EofSensorInputStream.read (EofSensorInputStream.java:135)
[2023-04-05T02:54:26.051Z] at org.apache.http.conn.EofSensorInputStream.read (EofSensorInputStream.java:148)
[2023-04-05T02:54:26.051Z] at org.eclipse.aether.spi.connector.transport.AbstractTransporter.copy (AbstractTransporter.java:163)
[2023-04-05T02:54:26.051Z] at org.eclipse.aether.spi.connector.transport.AbstractTransporter.utilGet (AbstractTransporter.java:94)
[2023-04-05T02:54:26.051Z] at org.eclipse.aether.transport.http.HttpTransporter.access$200 (HttpTransporter.java:96)
[2023-04-05T02:54:26.051Z] at org.eclipse.aether.transport.http.HttpTransporter$EntityGetter.handle (HttpTransporter.java:565)
[2023-04-05T02:54:26.051Z] at org.eclipse.aether.transport.http.HttpTransporter.execute (HttpTransporter.java:374)
[2023-04-05T02:54:26.051Z] at org.eclipse.aether.transport.http.HttpTransporter.implGet (HttpTransporter.java:319)
[2023-04-05T02:54:26.051Z] at org.eclipse.aether.spi.connector.transport.AbstractTransporter.get (AbstractTransporter.java:63)
[2023-04-05T02:54:26.051Z] at org.eclipse.aether.connector.basic.BasicRepositoryConnector$GetTaskRunner.runTask (BasicRepositoryConnector.java:482)
[2023-04-05T02:54:26.051Z] at org.eclipse.aether.connector.basic.BasicRepositoryConnector$TaskRunner.run (BasicRepositoryConnector.java:414)
[2023-04-05T02:54:26.051Z] at org.eclipse.aether.util.concurrency.RunnableErrorForwarder.lambda$wrap$0 (RunnableErrorForwarder.java:66)
[2023-04-05T02:54:26.051Z] at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1128)
[2023-04-05T02:54:26.051Z] at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:628)
[2023-04-05T02:54:26.051Z] at java.lang.Thread.run (Thread.java:829)
script returned exit code 1
Closing as the "unreliable" behavior (which is BOM-only) is tracked in https://github.com/jenkins-infra/helpdesk/issues/3481
I am not sure why scare quotes were used. When trying to enable the artifact caching proxy in ATH, I got failures on both DigitalOcean and Azure, but not when I removed the artifact caching proxy (using Maven 3.9.1 in all three cases). The artifact caching proxy is clearly unreliable, and this is clearly not BOM-only.
About the ACP Azure problem encountered with ATH:
Connection Reset
.Nothing obvious at first sight.
Currently checking the metrics of the VMs used as agent by the ATH to see if there has been an error on these machines
Extracted the list of agent names from https://ci.jenkins.io/job/Core/job/acceptance-test-harness/job/PR-1097/10/ with the command grep -iIr 'running on ubuntu' consoleText | cut -d' ' -f4 | sort -u
, adapted it to the following Datadog filter:
host:ubuntu-20-ba8a00 OR host:ubuntu-20-highmem0d07c0 OR host:ubuntu-20-highmem0d07c1 OR host:ubuntu-20-highmem165be1 OR host:ubuntu-20-highmem1b0380 OR host:ubuntu-20-highmem2b8f80 OR host:ubuntu-20-highmem2b8f81 OR host:ubuntu-20-highmem34e3e1 OR host:ubuntu-20-highmem4a1770 OR host:ubuntu-20-highmem5a9190 OR host:ubuntu-20-highmem5a9191 OR host:ubuntu-20-highmem5a9192 OR host:ubuntu-20-highmem63e5d0 OR host:ubuntu-20-highmem826dc0 OR host:ubuntu-20-highmem97a160 OR host:ubuntu-20-highmembe07f0 OR host:ubuntu-20-highmembf7da0 OR host:ubuntu-20-highmemcff7d0 OR host:ubuntu-20-highmemdc8fe0 OR host:ubuntu-20-highmemfc8d80 OR host:
=> it appears that only one of these machines reported metrics during the past 24h, and only for a few minutes:
while there is a build of the ATH (https://ci.jenkins.io/job/Core/job/acceptance-test-harness/job/master/722/) currently being build and its VM agents are reporting metrics:
Azure status does not report any incident though: https://azure.status.microsoft/en-us/status/ and https://azure.status.microsoft/en-us/status/history/.
I'm not sure where to go from here: again, there is nothing pointing at the ACP particularly, but it is hard to diagnose so any idea to help is welcome.
Feels like a network issue, which might be caused by the IP overlap in the legacy networks (ref. https://github.com/jenkins-infra/helpdesk/issues/3351). Proposal: setting the network of the ci.jenkins.io closer to the ACP's to decrease the latency and in a not-overlapped network
Closing as the "unreliable" behavior (which is BOM-only) is tracked in #3481
I am not sure why scare quotes were used. When trying to enable the artifact caching proxy in ATH, I got failures on both DigitalOcean and Azure, but not when I removed the artifact caching proxy (using Maven 3.9.1 in all three cases). The artifact caching proxy is clearly unreliable, and this is clearly not BOM-only.
I can edit my comment if the quotes bother you. There is nothing to understand from this quotes. It was written by reflex by someone who is not a native English speaker.
Yes, you have issues when enabling the ACP. Now that you reported the issue with the ATH we are looking on it.
Scare quotes are used to elicit doubt, so I wanted to be clear that there is a problem. I am happy to step back and let the infrastructure team investigate. I do not have the context and access to dig deeper anyway, and I do not debug problems like this with dashboards and log files — I am more old-school and debug problems like this by doing experiments and running system-level commands to observe the effects. Nevertheless I have never met a debugging problem that I could not eventually solve.
FWIW since the merge of your PR introducing ACP on ATH, there has been 3 builds on master and 3 builds on PRs without any ACP error.
FWIW since your PR introducing ACP on ATH, there has been 3 builds on master and 3 builds on PRs without any ACP error.
That is expected, because I removed the lines introducing the artifact caching proxy from that PR:
giving up on enabling artifact caching proxy for ATH.
I would have kept them, if it were not for the proxy being unreliable as described in this ticket.
That is expected, because I removed the lines introducing the artifact caching proxy from that PR
The infra.runMaven
function is using artifact caching proxy by default.
This is the function now used in the Jenkinsfile on the master
branch since the merge of your PR: https://github.com/jenkinsci/acceptance-test-harness/blob/master/Jenkinsfile#L47
We can see the use of the ACP in the recent master build logs, ex: https://ci.jenkins.io/job/Core/job/acceptance-test-harness/job/master/725/consoleText
[2023-04-05T10:37:00.754Z] INFO: using artifact caching proxy from 'do' provider.
[2023-04-05T10:37:05.571Z] + mvn --batch-mode --show-version --errors --no-transfer-progress -s /home/jenkins/agent/workspace/e_acceptance-test-harness_master@tmp/config14181552121585774481tmp -Dset.changelist -DskipTests -U clean install
You seem to be missing the fact that the primary usage of Maven in that job is within run.sh
. When I tried to make that use the proxy yesterday, I saw failures on both Azure and DigitalOcean as described in this ticket.
Starting to work on the network of the ci.jenkins.io's VM agents. Goal: ensure that the subnet where the machines are spawned is close to the ACP and does not suffer from any kind of routing overlap to avoid abrupt cuts in the intermediate networking layers.
Got an error with the AWS ACP, in the context of a BOM build on its master
branch, involving a temporary failure in DNS resolution: https://ci.jenkins.io/blue/organizations/jenkins/Tools%2Fbom/detail/master/1666/pipeline#step-5846-log-248
It's not an issue with ACP itself, but worth mentioning.
Hello 👋 It's worth retrying ATH artifact caching proxy as https://github.com/jenkins-infra/helpdesk/issues/3535#issuecomment-1607874573 was implemented : that should remove a lot of the network errors seen in the past months
I've opened a PR at https://github.com/jenkinsci/acceptance-test-harness/pull/1279
Since the PR merge two weeks ago I didn't notice particular ATH failures directly related to the artifact caching proxy.
It seems to me we can close this issue, WDYT @basil ?
No objection, closing this issue.
Sorry, I missed this notification last week. In any case I have not seen any ACP failures. Thank you very much!
Service(s)
Artifact-caching-proxy
Summary
Despite https://github.com/jenkins-infra/helpdesk/issues/2752#issuecomment-1490215215, https://ci.jenkins.io/job/Tools/job/bom/job/master/1568/ failed after approximately 90 minutes with
Reproduction steps
Run a lot of
jenkinsci/bom
builds without theskip-artifact-caching-proxy
label and you are bound to see it within a few attempts.