gradle / gradle

Adaptable, fast automation for all
https://gradle.org
Apache License 2.0
16.6k stars 4.65k forks source link

Socket connections from non-gradle applications crash gradle executors #27801

Open gharris1727 opened 7 months ago

gharris1727 commented 7 months ago

Current Behavior

A Gradle executor experiences this exception:

Unexpected exception thrown.
org.gradle.internal.remote.internal.MessageIOException: Could not read message from '/127.0.0.1:49184'.
        at org.gradle.internal.remote.internal.inet.SocketConnection.receive(SocketConnection.java:94)
        at org.gradle.internal.remote.internal.hub.MessageHub$ConnectionReceive.run(MessageHub.java:270)
        at org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:64)
        at org.gradle.internal.concurrent.AbstractManagedExecutor$1.run(AbstractManagedExecutor.java:47)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.IllegalArgumentException: 
        at org.gradle.internal.remote.internal.hub.InterHubMessageSerializer$MessageReader.read(InterHubMessageSerializer.java:72)
        at org.gradle.internal.remote.internal.hub.InterHubMessageSerializer$MessageReader.read(InterHubMessageSerializer.java:52)
        at org.gradle.internal.remote.internal.inet.SocketConnection.receive(SocketConnection.java:81)
        at org.gradle.internal.remote.internal.hub.MessageHub$ConnectionReceive.run(MessageHub.java:270)
        at org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:64)
        at org.gradle.internal.concurrent.AbstractManagedExecutor$1.run(AbstractManagedExecutor.java:47)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
org.gradle.internal.remote.internal.ConnectException: Could not connect to server [4f40fbf1-8d30-4c4a-9ba1-aaa9b723f4ea port:39287, addresses:[/127.0.0.1]]. Tried addresses: [/127.0.0.1].
    at org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.connect(TcpOutgoingConnector.java:67)
    at org.gradle.internal.remote.internal.hub.MessageHubBackedClient.getConnection(MessageHubBackedClient.java:36)
    at org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:103)
    at org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:65)
    at worker.org.gradle.process.internal.worker.GradleWorkerMain.run(GradleWorkerMain.java:69)
    at worker.org.gradle.process.internal.worker.GradleWorkerMain.main(GradleWorkerMain.java:74)
Caused by: java.net.ConnectException: Connection refused
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
    at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:122)
    at org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.tryConnect(TcpOutgoingConnector.java:81)
    at org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.connect(TcpOutgoingConnector.java:54)
    ... 5 more

And the build fails due to this error:

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':storage:test'.
> Process 'Gradle Test Executor 102' finished with non-zero exit value 1
  This problem might be caused by incorrect test process configuration.
  For more on test execution, please refer to https://docs.gradle.org/8.5/userguide/java_testing.html#sec:test_execution in the Gradle documentation.

* Try:
> Run with --stacktrace option to get the stack trace.
> Run with --info or --debug option to get more log output.
> Get more help at https://help.gradle.org.

Expected Behavior

The connection should be accepted, and then disconnected once the received data is shown to be invalid. The Gradle executor should continue running and accepting connections until a valid connection has been made. The Gradle executor could keep running indefinitely without a connection, or given a moderate (1min) timeout before the connection is abandoned completely.

Context (optional)

We use Gradle to build the Apache Kafka project, and received this bug report from one of our contributors.

We have numerous tests which appear to leak clients in this way, and it appears that the particulars of the Gradle protocol and the Kafka protocol happen to cause the Gradle end of the connection to exit ungracefully.

We are also working to reduce the number of leaked clients and network connections in our project to mitigate this failure, but it is very likely that there are downstream users of Kafka clients which could experience the same style of defect in their tests, and have similar build failures. And more generally, any sort of client could happen to connect to the Gradle executor and trigger this failure with a crafted response. This could be happening without Kafka in use at all, theoretically.

This appears similar to the descriptions of #27512, #23602, and #12566 but I thought to open this issue separately with my reproduction case if it turns out to be distinct.

Steps to Reproduce

  1. A test opens a server socket with port 0, and gets assigned a random port N
  2. The test starts a client which will continue to reconnect to port N indefinitely if the connection with the server is lost
  3. The test stops the server, and port N is released to the OS and allowed to be reassigned
  4. The client beings retrying connections to port N
  5. A Gradle executor daemon port is eventually assigned port N by the OS
  6. The client connects to port N, and reaches the Gradle daemon unintentionally
  7. The Gradle daemon abandons that connection with an error and closes the listening socket
  8. The Gradle executor is killed with an exception when it tries to reach the closed socket

Here is a reproducer: https://github.com/gharris1727/gradle-crash-reproducer . For a specific failed build, see https://github.com/gharris1727/gradle-crash-reproducer/actions/runs/7632481625/job/20792782276 .

The reproducer has 4 modules, structured such that the tests for app-2 start after the tests for app-1 are already running. app-1 is continuously searching for ports which it can't bind() the address for, and connecting to them and then immediately closing the connection. This specific reproducer relies on MacOS for sequential port allocation in order to make it more reliable, but I think the same failure is possible on any platform.

Here's the logs from the reproducer, in case the action becomes inaccessible:

2024-01-23T22:13:08.059+0000 [DEBUG] [TestEventLogger] Gradle Test Run :app-2:test STARTED
...
2024-01-23T22:13:08.069+0000 [DEBUG] [org.gradle.internal.remote.internal.inet.TcpIncomingConnector] Listening on [77f73e7b-bbe9-4630-8033-0a25e66a99f3 port:58112, addresses:[localhost/127.0.0.1]].
...
2024-01-23T22:13:08.393+0000 [DEBUG] [TestEventLogger] TestWithLeakedClient > test() STANDARD_OUT
2024-01-23T22:13:08.393+0000 [DEBUG] [TestEventLogger]     [58112, 58113, 54322, 54323, 55555, 54324, 55556]
...
Error: -23T22:13:08.406+0000 [ERROR] [org.gradle.internal.remote.internal.hub.MessageHubBackedObjectConnection] Unexpected exception thrown.
org.gradle.internal.remote.internal.MessageIOException: Could not write '/127.0.0.1:54325'.
    at org.gradle.internal.remote.internal.inet.SocketConnection.flush(SocketConnection.java:140)
    at org.gradle.internal.remote.internal.hub.MessageHub$ConnectionDispatch.run(MessageHub.java:333)
    at org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:64)
    at org.gradle.internal.concurrent.AbstractManagedExecutor$1.run(AbstractManagedExecutor.java:47)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.IOException: Broken pipe
    at java.base/sun.nio.ch.FileDispatcherImpl.write0(Native Method)
    at java.base/sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
    at java.base/sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:113)
    at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:58)
    at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:50)
    at java.base/sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:462)
    at org.gradle.internal.remote.internal.inet.SocketConnection$SocketOutputStream.writeWithNonBlockingRetry(SocketConnection.java:279)
    at org.gradle.internal.remote.internal.inet.SocketConnection$SocketOutputStream.writeBufferToChannel(SocketConnection.java:267)
    at org.gradle.internal.remote.internal.inet.SocketConnection$SocketOutputStream.flush(SocketConnection.java:261)
    at org.gradle.internal.remote.internal.inet.SocketConnection.flush(SocketConnection.java:138)
    ... 6 more
Error: -23T22:13:08.609+0000 [ERROR] [system.err] org.gradle.internal.remote.internal.ConnectException: Could not connect to server [77f73e7b-bbe9-4630-8033-0a25e66a99f3 port:58112, addresses:[/127.0.0.1]]. Tried addresses: [/127.0.0.1].
Error: -23T22:13:08.610+0000 [ERROR] [system.err]   at org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.connect(TcpOutgoingConnector.java:67)
Error: -23T22:13:08.610+0000 [ERROR] [system.err]   at org.gradle.internal.remote.internal.hub.MessageHubBackedClient.getConnection(MessageHubBackedClient.java:36)
Error: -23T22:13:08.610+0000 [ERROR] [system.err]   at org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:103)
Error: -23T22:13:08.610+0000 [ERROR] [system.err]   at org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:65)
Error: -23T22:13:08.610+0000 [ERROR] [system.err]   at worker.org.gradle.process.internal.worker.GradleWorkerMain.run(GradleWorkerMain.java:69)
Error: -23T22:13:08.610+0000 [ERROR] [system.err]   at worker.org.gradle.process.internal.worker.GradleWorkerMain.main(GradleWorkerMain.java:74)
Error: -23T22:13:08.611+0000 [ERROR] [system.err] Caused by: java.net.ConnectException: Connection refused
Error: -23T22:13:08.611+0000 [ERROR] [system.err]   at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
Error: -23T22:13:08.611+0000 [ERROR] [system.err]   at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:774)
Error: -23T22:13:08.611+0000 [ERROR] [system.err]   at java.base/sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:120)
Error: -23T22:13:08.612+0000 [ERROR] [system.err]   at org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.tryConnect(TcpOutgoingConnector.java:81)
Error: -23T22:13:08.612+0000 [ERROR] [system.err]   at org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.connect(TcpOutgoingConnector.java:54)
Error: -23T22:13:08.612+0000 [ERROR] [system.err]   ... 5 more

Gradle version

8.5

Build scan URL (optional)

No response

Your Environment (optional)

No response

cobexer commented 7 months ago

Thank you for providing a valid report.

The issue is in the backlog of the relevant team and is prioritized by them.


Communication between the Gradle Daemon and a test executor may also avoid using a channel discoverable from outside the Gradle daemon entirely.

gharris1727 commented 5 months ago

Hi @cobexer!

Can you share an estimate for when the team may address this issue? I want to know whether I should solicit an Apache Kafka contributor to contribute a fix, or pursue a fix myself.

Thanks!

cpholt commented 2 months ago

I think we're hitting this same issue without kafka. I just upgraded gradle from 7.4 to 8.5 and now every jenkins job fails with the same mystery IOException. Lots of our test follow the same model where a client will repeatedly connect to a server. I'm assuming that we're seeing the same thing were a client accidentally hits a gradle port.

Has any progress been made on this?

My next step will be to try newer version of gradle after 8.5.

UPDATE:

After A LOT of test runs and narrowing down the failing test, it turns out my problem was caused by a unit test triggering a call to Runtime.halt(). My previous approach was to set a system property in the test task in gradle to block calls to the method in question. For some reason this stopped working in gradle 8.

gradle.taskGraph.whenReady { taskGraph -> taskGraph.allTasks.each { task -> if (task.name == "test") { task.systemProperties = [testingnoexit:"1"] task.ignoreFailures = true task.afterSuite { desc, result -> if (desc.getParent() == null) { ext.testFailures += result.getFailedTestCount() //count failures } } testCheck.dependsOn(task) } } }

My new approach is to try Class.forName() on a JUnit class. if that succeeds, I assume I'm in a unit test and the halt() call is avoided.

gharris1727 commented 2 months ago

@cpholt I think your situation is better captured in #3610 because it involves calling System.exit and Runtime.halt. This issue is specific to the network connection interference, and the gradle worker exiting is a symptom, rather than a cause.

Kafka normally has control over the JVM execution lifetime, and calls System.exit/Runtime.halt. In tests we have to mock out the exit procedures to prevent the situation you're describing.

cpholt commented 2 months ago

@gharris1727 You are absolutely correct. I had worked through the System.exit/Runtime.halt under gradle 7.4 years ago. When I moved to 8.5/8.8, i was seeing a failure which lead me here (incorrectly). Once I realized it was the exit/halt issue, I had it resolved (again) withing a few minutes. I hadn't seen the #3610 issue, but that's it exactly.