google / conscrypt

Conscrypt is a Java Security Provider that implements parts of the Java Cryptography Extension and Java Secure Socket Extension.
Apache License 2.0
1.26k stars 266 forks source link

Indefinite blocks on NativeCrypto.SSL_do_handshake #864

Open juneoh opened 4 years ago

juneoh commented 4 years ago

We have user reports from Google Cloud Dataproc that threads/tasks would intermittently hang on NativeCrypto.SSL_do_handshake. There were some Java and native thread dumps shared in GoogleCloudDataproc/hadoop-connectors/issues/153, but we were unable to produce a solid reproduction so far.

Dataproc currently uses Conscrypt 1.4.2 on java-8-openjdk-amd64, and on the OS side we have heard from both Debian 9 (Debian-provided OpenJDK) and Debian 10 (AdoptOpenJDK). One user shared with GCP Support that they still saw the same hang even after manually updating to Conscrypt 2.4.0.

Below is an example of the thread dump.

"gcsfs-batch-helper-3912" #4130 daemon prio=5 os_prio=0 tid=0x00007f92a4040800 nid=0x3130 runnable [0x00007f9330815000]
   java.lang.Thread.State: RUNNABLE
    at org.conscrypt.NativeCrypto.SSL_do_handshake(Native Method)
    at org.conscrypt.NativeSsl.doHandshake(NativeSsl.java:392)
    at org.conscrypt.ConscryptFileDescriptorSocket.startHandshake(ConscryptFileDescriptorSocket.java:225)
    at org.conscrypt.ConscryptFileDescriptorSocket.waitForHandshake(ConscryptFileDescriptorSocket.java:474)
    at org.conscrypt.ConscryptFileDescriptorSocket.getOutputStream(ConscryptFileDescriptorSocket.java:461)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:465)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
    - locked <0x0000000651a1d8a8> (a sun.net.www.protocol.https.HttpsClient)
    at sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:264)
    at sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:367)
    at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:191)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1156)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1050)
    at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:177)
    at sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:162)
    at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:104)
    at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.HttpRequest.execute(HttpRequest.java:981)
    at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
    at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
    at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
    at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.BatchHelper.execute(BatchHelper.java:175)
    at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.BatchHelper.lambda$queue$0(BatchHelper.java:163)
    at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.BatchHelper$$Lambda$96/631118845.call(Unknown Source)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
prbprbprb commented 4 years ago

Looking at the native stack trace in the referenced bug, I'm a bit surprised that it literally just calls poll, there ought to be some intermediate calls although maybe this code has just changed a lot since 1.4.2. Not important but it would be nice to pin down exactly where it's hanging. The upshot though is that Conscrypt is waiting for data from the server which isn't arriving and is hanging forever.

The fact you're not seeing this with other TLS implementations and people are still seeing it with 2.4.0 does suggest it's a Conscrypt bug. However we're actively migrating away from this implementation of SSLSocket (mainly due to the complexity of the native code) so unless you have a reliable repro, there's not a lot we can do.

It would be great if you guys could test against the new implementation, but that would mean updating to 2.4.0 or later, and the dataproc bug suggests you're blocked by #864. On the plus side, we have someone actively working on that. And for testing purposes you could probably use 2.4.0 with some code to install your own TrustManager or ConscryptHostnameVerifier.

medb commented 4 years ago

May you elaborate on how to test a new socket implementation instead of SSLSocket in Conscrypt 2.4.0?

Maybe the bug here could be as simple as that Conscrypt does not set a read timeout (or some other timeout) on SSLSocket?

prbprbprb commented 4 years ago

Ah sorry, to test the engine-based implementation, call Conscrypt.setUseEngineSocketByDefault(true); before creating any sockets or socket factories or you can set the Java property org.conscrypt.useEngineSocketByDefault to true on the command line. Note that we plan to switch the default in 2.5.0 anyway which was targeted for this week but is currently blocked by an unrelated bug.

By default I believe the socket timeouts are 0 (i.e. indefinite). There's some complexity here between the implementations...

Both implementations support read timeouts using the standard Socket.setSoTimeout() API. If your app is setting this timeout and you're still seeing input hangs then that's a definite bug.

ConscryptFileDescriptorSocket implements a separate write timeout. ConscryptEngineSocket currently does not. There is an unsupported reflexive API for setting it but it's discouraged (and is a no-op on ConscryptEngineSocket).

ConscryptFileDescriptorSocket also has the notion of a handshake timeout which overrides the read and write timeouts for the duration of the TLS handshake. There is, however, no public API for setting this and no plans to add one. If you need a different timeout for the handshake just call setSoTimeout() again once the connection is established.

medb commented 3 years ago

Dataproc was updated to use Conscrypt 2.5.1 by default, I think that we can close this issue until we will have an evidence that it re-occurs on Conscrypt 2.5.1+.

medb commented 3 years ago

We got reoccurrence of the same hanging with Conscrypt 2.5.1 that uses new ConscryptEngineSocket by default:

08      gcsfs-batch-helper-151  RUNNABLE        Monitor(java.lang.Object@1728667131}), Monitor(sun.net.www.protocol.https.HttpsURLConnectionImpl@525072463}), Monitor(sun.net.www.protocol.https.HttpsClient@575051682}), Monitor(sun.net.www.protocol.https.DelegateHttpsURLConnection@389579666}), Lock(java.util.concurrent.ThreadPoolExecutor$Worker@2076466387})
java.net.SocketInputStream.socketRead0(Native Method)
java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
java.net.SocketInputStream.read(SocketInputStream.java:171)
java.net.SocketInputStream.read(SocketInputStream.java:141)
org.conscrypt.ConscryptEngineSocket$SSLInputStream.readFromSocket(ConscryptEngineSocket.java:920)
org.conscrypt.ConscryptEngineSocket$SSLInputStream.processDataFromSocket(ConscryptEngineSocket.java:884)
org.conscrypt.ConscryptEngineSocket$SSLInputStream.access$100(ConscryptEngineSocket.java:706)
org.conscrypt.ConscryptEngineSocket.doHandshake(ConscryptEngineSocket.java:230)
org.conscrypt.ConscryptEngineSocket.startHandshake(ConscryptEngineSocket.java:209) => holding Monitor(java.lang.Object@1728667131})
org.conscrypt.ConscryptEngineSocket.waitForHandshake(ConscryptEngineSocket.java:547)
org.conscrypt.ConscryptEngineSocket.getOutputStream(ConscryptEngineSocket.java:290)
sun.net.www.http.HttpClient.openServer(HttpClient.java:465)
sun.net.www.http.HttpClient.openServer(HttpClient.java:558) => holding Monitor(sun.net.www.protocol.https.HttpsClient@575051682})
sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:264)
sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:367)
sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:191)
sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1162)
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1056)
sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:177)
sun.net.www.protocol.http.HttpURLConnection.getOutputStream0(HttpURLConnection.java:1340) => holding Monitor(sun.net.www.protocol.https.DelegateHttpsURLConnection@389579666})
sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1315) => holding Monitor(sun.net.www.protocol.https.DelegateHttpsURLConnection@389579666})
sun.net.www.protocol.https.HttpsURLConnectionImpl.getOutputStream(HttpsURLConnectionImpl.java:264) => holding Monitor(sun.net.www.protocol.https.HttpsURLConnectionImpl@525072463})
com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:113)
com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:84)
com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1011)
com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.batch.BatchRequest.execute(BatchRequest.java:241)
com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.BatchHelper.lambda$flushPendingRequests$2(BatchHelper.java:224)
com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.BatchHelper$$Lambda$228/726710777.call(Unknown Source)
java.util.concurrent.FutureTask.run(FutureTask.java:266)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)
samwiise commented 3 years ago

Issue is still there in Conscrypt 2.5.1

"pool-7-thread-4" #33 prio=5 os_prio=0 tid=0x00007f3b383f3800 nid=0x35fe runnable [0x00007f3b3ca4e000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:171) at java.net.SocketInputStream.read(SocketInputStream.java:141) at org.conscrypt.ConscryptEngineSocket$SSLInputStream.readFromSocket(ConscryptEngineSocket.java:920) at org.conscrypt.ConscryptEngineSocket$SSLInputStream.processDataFromSocket(ConscryptEngineSocket.java:884) at org.conscrypt.ConscryptEngineSocket$SSLInputStream.access$100(ConscryptEngineSocket.java:706) at org.conscrypt.ConscryptEngineSocket.doHandshake(ConscryptEngineSocket.java:230) at org.conscrypt.ConscryptEngineSocket.startHandshake(ConscryptEngineSocket.java:209)

medb commented 3 years ago

@prbprbprb Maybe this hang is caused by the JDK-8238579 issue?

cnauroth commented 1 year ago

I think I've found the root cause. It requires a careful reading of the sun.net.www JDK classes and how they interact with Conscrypt classes. For reference, here is the Java version used in my tests:

openjdk version "1.8.0_332"
OpenJDK Runtime Environment (Temurin)(build 1.8.0_332-b09)
OpenJDK 64-Bit Server VM (Temurin)(build 25.332-b09, mixed mode)

All of the following stack traces and code walkthrough will refer to OpenJDK 8u322 and Conscrypt 2.5.2.

Following are stack traces from running a simple test harness that uses HttpsURLConnection with a 20-second read timeout. To simulate a network fault, I'm applying an iptables rule on connections to a GCS IP address and then using /etc/hosts to route all GCS connections to that IP address:

IP=$(nslookup storage.googleapis.com | grep -E 'Address: [0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' | head -1 | awk '{ print $NF }')

sudo iptables -A OUTPUT -p tcp --source 10.240.1.19 --destination "${IP}" -m state --state ESTABLISHED -j DROP

cnauroth@cnauroth-conscrypt-on-m:~$ cat /etc/hosts
127.0.0.1   localhost
::1     localhost ip6-localhost ip6-loopback
ff02::1     ip6-allnodes
ff02::2     ip6-allrouters

108.177.111.128 storage.googleapis.com

10.240.0.199 cnauroth-conscrypt-on-m.us-central1-c.c.hadoop-cloud-dev.google.com.internal cnauroth-conscrypt-on-m  # Added by Google
169.254.169.254 metadata.google.internal  # Added by Google

The iptables rule allows for the initial syn/ack to get the connection established, but then all subsequent inbound packets are dropped before reaching the application layer. We'd expect to see the first socket read during the TLS handshake to timeout after 20 seconds. Instead, the read hangs indefinitely.

First, here is the test using Conscrypt, and running jstack on the process after it hangs:

"main" #1 prio=5 os_prio=0 tid=0x00007fc3f800a800 nid=0x54ac runnable [0x00007fc3ff14a000]
   java.lang.Thread.State: RUNNABLE
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    at java.net.SocketInputStream.read(SocketInputStream.java:171)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)
    at org.conscrypt.ConscryptEngineSocket$SSLInputStream.readFromSocket(ConscryptEngineSocket.java:920)
    at org.conscrypt.ConscryptEngineSocket$SSLInputStream.processDataFromSocket(ConscryptEngineSocket.java:884)
    at org.conscrypt.ConscryptEngineSocket$SSLInputStream.access$100(ConscryptEngineSocket.java:706)
    at org.conscrypt.ConscryptEngineSocket.doHandshake(ConscryptEngineSocket.java:230)
    at org.conscrypt.ConscryptEngineSocket.startHandshake(ConscryptEngineSocket.java:209)
    - locked <0x0000000772a610a0> (a java.lang.Object)
    at org.conscrypt.ConscryptEngineSocket.waitForHandshake(ConscryptEngineSocket.java:547)
    at org.conscrypt.ConscryptEngineSocket.getOutputStream(ConscryptEngineSocket.java:290)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:465)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
    - locked <0x0000000772a3da00> (a sun.net.www.protocol.https.HttpsClient)
    at sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:264)
    at sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:367)
    at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:203)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1162)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1056)
    at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:189)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1572)
    - locked <0x0000000772a39df8> (a sun.net.www.protocol.https.DelegateHttpsURLConnection)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1500)
    - locked <0x0000000772a39df8> (a sun.net.www.protocol.https.DelegateHttpsURLConnection)
    at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:268)
    - locked <0x0000000771eadb40> (a sun.net.www.protocol.https.HttpsURLConnectionImpl)
    at TestHttpsURLConnection.main(TestHttpsURLConnection.java:17)

Second, here is the test using the default security provider, resulting in an exception after the 20-second read timeout:

Exception in thread "main" javax.net.ssl.SSLException: Read timed out
    at sun.security.ssl.Alert.createSSLException(Alert.java:127)
    at sun.security.ssl.TransportContext.fatal(TransportContext.java:324)
    at sun.security.ssl.TransportContext.fatal(TransportContext.java:267)
    at sun.security.ssl.TransportContext.fatal(TransportContext.java:262)
    at sun.security.ssl.SSLTransport.decode(SSLTransport.java:138)
    at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1397)
    at sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1305)
    at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:440)
    at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:559)
    at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:197)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1572)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1500)
    at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:268)
    at TestHttpsURLConnection.main(TestHttpsURLConnection.java:18)
Caused by: java.net.SocketTimeoutException: Read timed out
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    at java.net.SocketInputStream.read(SocketInputStream.java:171)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)
    at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:464)
    at sun.security.ssl.SSLSocketInputRecord.decode(SSLSocketInputRecord.java:165)
    at sun.security.ssl.SSLTransport.decode(SSLTransport.java:109)
    ... 9 more

It's interesting to note that both processes are performing TLS handshake, but they are doing the handshake from different points of execution in AbstractDelegateHttpsURLConnection#connect. With Conscrypt, we are on line 189, invoking HttpURLConnection#plainConnect. With the default security provider, we are past that and instead down to line 197, calling HttpsClient#afterConnect.

Execution with Conscrypt proceeds into HttpURLConnection#plainConnect0 line 1162. However, note here that the read timeout is not applied until line 1163, after this completes!

                                http = getNewHttpClient(url, p, connectTimeout);
                                http.setReadTimeout(readTimeout);

Thus, it appears that Conscrypt, unlike the default security provider, is performing the TLS handshake at a time when the read timeout has not yet been applied. It seems like the sun.net.www classes have a hidden assumption that TLS handshake will not be performed until HttpsClient#afterConnect. Conscrypt, instead, triggers the TLS handshake earlier in ConscryptEngineSocket#getOutputStream. The default provider's SSLSocketImpl#getOutputStream does not trigger TLS handshake by side effect like this.

To summarize:

It doesn't appear that there is any Java upgrade path that would resolve this issue. Reviewing git diff jdk8u332-ga..master, I don't see any recent changes in sun.net.www that would change this behavior. Similarly, more recent JVMs retain the same behavior, e.g. Java 17:

https://github.com/openjdk/jdk17u/blob/c1e17197222932a03a04d3b8d9c0d7f94be07947/src/java.base/share/classes/sun/net/www/protocol/http/HttpURLConnection.java#L1242

Xavron commented 2 months ago

Am seeing FX File Explorer (network) client use on Android stuck during org.conscrypt.NativeCrypto.SSL_do_handshake(Native Method).

Samsung Files (network) client also fails here.

_For my use case, its showing up with Android to Android and is happening on Android 8 test device. Android 14 test device isn't seeing this (NativeCrypto.SSL_dohandshake(Native Method)) with the exact same code pushed out to it.

If SocketInputStream.socketRead0 is still the same issue then that's what I'm seeing on Android 14. I didn't look to see how that's connected.

Update 1: I randomly noticed today that SocketInputStream.socketRead0 can simply show from handshake timeout being set and happening sslSocket.setSoTimeout(). Where the client side can overcome this as 2 different clients have 2 different results in my case (one works and the other doesn't and so for that one the timeout occurs). For whatever reason, I didn't notice this until now. I don't yet know if this should be definitely be a client problem or not, but in my usage, clients can deal with it.

Update 2: Two clients now working and SocketInputStream.socketRead0 eliminated as a thing from the 2nd. Still have it for some other clients that I now need to look into. I'm definitely so far only seeing this as a client software issue and not as a Conscrypt or Java one. There's going to be a hang if the client causes one :)

Final update: NativeCrypto.SSL_do_handshake(Native Method) also seen fixed by same client problem as in update 2 for the rest of the clients. Should be fixed via client (in my case). Although, I'm using server side workaround to fix the clients. This is not a workaround to post, as the situation I see is different in code and won't be of any help at all. At the same time, both sides need to communicate properly, so if one side expects something and the other expects something and neither give, then this is what I'm seeing cause both of these two to show up. It could be different problem from the issue or the same. Either way, that would be they're the same or there's something else involved and not the title problem.