Open juneoh opened 4 years ago
Looking at the native stack trace in the referenced bug, I'm a bit surprised that it literally just calls poll
, there ought to be some intermediate calls although maybe this code has just changed a lot since 1.4.2. Not important but it would be nice to pin down exactly where it's hanging. The upshot though is that Conscrypt is waiting for data from the server which isn't arriving and is hanging forever.
The fact you're not seeing this with other TLS implementations and people are still seeing it with 2.4.0 does suggest it's a Conscrypt bug. However we're actively migrating away from this implementation of SSLSocket
(mainly due to the complexity of the native code) so unless you have a reliable repro, there's not a lot we can do.
It would be great if you guys could test against the new implementation, but that would mean updating to 2.4.0 or later, and the dataproc bug suggests you're blocked by #864. On the plus side, we have someone actively working on that. And for testing purposes you could probably use 2.4.0 with some code to install your own TrustManager
or ConscryptHostnameVerifier
.
May you elaborate on how to test a new socket implementation instead of SSLSocket
in Conscrypt 2.4.0?
Maybe the bug here could be as simple as that Conscrypt does not set a read timeout (or some other timeout) on SSLSocket
?
Ah sorry, to test the engine-based implementation, call Conscrypt.setUseEngineSocketByDefault(true);
before creating any sockets or socket factories or you can set the Java property org.conscrypt.useEngineSocketByDefault
to true on the command line. Note that we plan to switch the default in 2.5.0 anyway which was targeted for this week but is currently blocked by an unrelated bug.
By default I believe the socket timeouts are 0 (i.e. indefinite). There's some complexity here between the implementations...
Both implementations support read timeouts using the standard Socket.setSoTimeout()
API. If your app is setting this timeout and you're still seeing input hangs then that's a definite bug.
ConscryptFileDescriptorSocket
implements a separate write timeout. ConscryptEngineSocket
currently does not. There is an unsupported reflexive API for setting it but it's discouraged (and is a no-op on ConscryptEngineSocket
).
ConscryptFileDescriptorSocket
also has the notion of a handshake timeout which overrides the read and write timeouts for the duration of the TLS handshake. There is, however, no public API for setting this and no plans to add one. If you need a different timeout for the handshake just call setSoTimeout()
again once the connection is established.
Dataproc was updated to use Conscrypt 2.5.1 by default, I think that we can close this issue until we will have an evidence that it re-occurs on Conscrypt 2.5.1+.
We got reoccurrence of the same hanging with Conscrypt 2.5.1 that uses new ConscryptEngineSocket
by default:
08 gcsfs-batch-helper-151 RUNNABLE Monitor(java.lang.Object@1728667131}), Monitor(sun.net.www.protocol.https.HttpsURLConnectionImpl@525072463}), Monitor(sun.net.www.protocol.https.HttpsClient@575051682}), Monitor(sun.net.www.protocol.https.DelegateHttpsURLConnection@389579666}), Lock(java.util.concurrent.ThreadPoolExecutor$Worker@2076466387})
java.net.SocketInputStream.socketRead0(Native Method)
java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
java.net.SocketInputStream.read(SocketInputStream.java:171)
java.net.SocketInputStream.read(SocketInputStream.java:141)
org.conscrypt.ConscryptEngineSocket$SSLInputStream.readFromSocket(ConscryptEngineSocket.java:920)
org.conscrypt.ConscryptEngineSocket$SSLInputStream.processDataFromSocket(ConscryptEngineSocket.java:884)
org.conscrypt.ConscryptEngineSocket$SSLInputStream.access$100(ConscryptEngineSocket.java:706)
org.conscrypt.ConscryptEngineSocket.doHandshake(ConscryptEngineSocket.java:230)
org.conscrypt.ConscryptEngineSocket.startHandshake(ConscryptEngineSocket.java:209) => holding Monitor(java.lang.Object@1728667131})
org.conscrypt.ConscryptEngineSocket.waitForHandshake(ConscryptEngineSocket.java:547)
org.conscrypt.ConscryptEngineSocket.getOutputStream(ConscryptEngineSocket.java:290)
sun.net.www.http.HttpClient.openServer(HttpClient.java:465)
sun.net.www.http.HttpClient.openServer(HttpClient.java:558) => holding Monitor(sun.net.www.protocol.https.HttpsClient@575051682})
sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:264)
sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:367)
sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:191)
sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1162)
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1056)
sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:177)
sun.net.www.protocol.http.HttpURLConnection.getOutputStream0(HttpURLConnection.java:1340) => holding Monitor(sun.net.www.protocol.https.DelegateHttpsURLConnection@389579666})
sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1315) => holding Monitor(sun.net.www.protocol.https.DelegateHttpsURLConnection@389579666})
sun.net.www.protocol.https.HttpsURLConnectionImpl.getOutputStream(HttpsURLConnectionImpl.java:264) => holding Monitor(sun.net.www.protocol.https.HttpsURLConnectionImpl@525072463})
com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:113)
com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:84)
com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1011)
com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.batch.BatchRequest.execute(BatchRequest.java:241)
com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.BatchHelper.lambda$flushPendingRequests$2(BatchHelper.java:224)
com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.BatchHelper$$Lambda$228/726710777.call(Unknown Source)
java.util.concurrent.FutureTask.run(FutureTask.java:266)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)
Issue is still there in Conscrypt 2.5.1
"pool-7-thread-4" #33 prio=5 os_prio=0 tid=0x00007f3b383f3800 nid=0x35fe runnable [0x00007f3b3ca4e000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:171) at java.net.SocketInputStream.read(SocketInputStream.java:141) at org.conscrypt.ConscryptEngineSocket$SSLInputStream.readFromSocket(ConscryptEngineSocket.java:920) at org.conscrypt.ConscryptEngineSocket$SSLInputStream.processDataFromSocket(ConscryptEngineSocket.java:884) at org.conscrypt.ConscryptEngineSocket$SSLInputStream.access$100(ConscryptEngineSocket.java:706) at org.conscrypt.ConscryptEngineSocket.doHandshake(ConscryptEngineSocket.java:230) at org.conscrypt.ConscryptEngineSocket.startHandshake(ConscryptEngineSocket.java:209)
@prbprbprb Maybe this hang is caused by the JDK-8238579 issue?
I think I've found the root cause. It requires a careful reading of the sun.net.www
JDK classes and how they interact with Conscrypt classes. For reference, here is the Java version used in my tests:
openjdk version "1.8.0_332"
OpenJDK Runtime Environment (Temurin)(build 1.8.0_332-b09)
OpenJDK 64-Bit Server VM (Temurin)(build 25.332-b09, mixed mode)
All of the following stack traces and code walkthrough will refer to OpenJDK 8u322 and Conscrypt 2.5.2.
Following are stack traces from running a simple test harness that uses HttpsURLConnection
with a 20-second read timeout. To simulate a network fault, I'm applying an iptables
rule on connections to a GCS IP address and then using /etc/hosts
to route all GCS connections to that IP address:
IP=$(nslookup storage.googleapis.com | grep -E 'Address: [0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' | head -1 | awk '{ print $NF }')
sudo iptables -A OUTPUT -p tcp --source 10.240.1.19 --destination "${IP}" -m state --state ESTABLISHED -j DROP
cnauroth@cnauroth-conscrypt-on-m:~$ cat /etc/hosts
127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
108.177.111.128 storage.googleapis.com
10.240.0.199 cnauroth-conscrypt-on-m.us-central1-c.c.hadoop-cloud-dev.google.com.internal cnauroth-conscrypt-on-m # Added by Google
169.254.169.254 metadata.google.internal # Added by Google
The iptables
rule allows for the initial syn/ack to get the connection established, but then all subsequent inbound packets are dropped before reaching the application layer. We'd expect to see the first socket read during the TLS handshake to timeout after 20 seconds. Instead, the read hangs indefinitely.
First, here is the test using Conscrypt, and running jstack
on the process after it hangs:
"main" #1 prio=5 os_prio=0 tid=0x00007fc3f800a800 nid=0x54ac runnable [0x00007fc3ff14a000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at org.conscrypt.ConscryptEngineSocket$SSLInputStream.readFromSocket(ConscryptEngineSocket.java:920)
at org.conscrypt.ConscryptEngineSocket$SSLInputStream.processDataFromSocket(ConscryptEngineSocket.java:884)
at org.conscrypt.ConscryptEngineSocket$SSLInputStream.access$100(ConscryptEngineSocket.java:706)
at org.conscrypt.ConscryptEngineSocket.doHandshake(ConscryptEngineSocket.java:230)
at org.conscrypt.ConscryptEngineSocket.startHandshake(ConscryptEngineSocket.java:209)
- locked <0x0000000772a610a0> (a java.lang.Object)
at org.conscrypt.ConscryptEngineSocket.waitForHandshake(ConscryptEngineSocket.java:547)
at org.conscrypt.ConscryptEngineSocket.getOutputStream(ConscryptEngineSocket.java:290)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:465)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
- locked <0x0000000772a3da00> (a sun.net.www.protocol.https.HttpsClient)
at sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:264)
at sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:367)
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:203)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1162)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1056)
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:189)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1572)
- locked <0x0000000772a39df8> (a sun.net.www.protocol.https.DelegateHttpsURLConnection)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1500)
- locked <0x0000000772a39df8> (a sun.net.www.protocol.https.DelegateHttpsURLConnection)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:268)
- locked <0x0000000771eadb40> (a sun.net.www.protocol.https.HttpsURLConnectionImpl)
at TestHttpsURLConnection.main(TestHttpsURLConnection.java:17)
Second, here is the test using the default security provider, resulting in an exception after the 20-second read timeout:
Exception in thread "main" javax.net.ssl.SSLException: Read timed out
at sun.security.ssl.Alert.createSSLException(Alert.java:127)
at sun.security.ssl.TransportContext.fatal(TransportContext.java:324)
at sun.security.ssl.TransportContext.fatal(TransportContext.java:267)
at sun.security.ssl.TransportContext.fatal(TransportContext.java:262)
at sun.security.ssl.SSLTransport.decode(SSLTransport.java:138)
at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1397)
at sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1305)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:440)
at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:559)
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:197)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1572)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1500)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:268)
at TestHttpsURLConnection.main(TestHttpsURLConnection.java:18)
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:464)
at sun.security.ssl.SSLSocketInputRecord.decode(SSLSocketInputRecord.java:165)
at sun.security.ssl.SSLTransport.decode(SSLTransport.java:109)
... 9 more
It's interesting to note that both processes are performing TLS handshake, but they are doing the handshake from different points of execution in AbstractDelegateHttpsURLConnection#connect
. With Conscrypt, we are on line 189, invoking HttpURLConnection#plainConnect
. With the default security provider, we are past that and instead down to line 197, calling HttpsClient#afterConnect
.
Execution with Conscrypt proceeds into HttpURLConnection#plainConnect0
line 1162. However, note here that the read timeout is not applied until line 1163, after this completes!
http = getNewHttpClient(url, p, connectTimeout);
http.setReadTimeout(readTimeout);
Thus, it appears that Conscrypt, unlike the default security provider, is performing the TLS handshake at a time when the read timeout has not yet been applied. It seems like the sun.net.www
classes have a hidden assumption that TLS handshake will not be performed until HttpsClient#afterConnect
. Conscrypt, instead, triggers the TLS handshake earlier in ConscryptEngineSocket#getOutputStream
. The default provider's SSLSocketImpl#getOutputStream
does not trigger TLS handshake by side effect like this.
To summarize:
HttpClient
seems to assume that it can call getOutputStream()
before completing TLS handshake.getOutputStream()
starts the TLS handshake by side effect.HttpURLConnection
does not apply the socket read timeout until after this point of execution.It doesn't appear that there is any Java upgrade path that would resolve this issue. Reviewing git diff jdk8u332-ga..master
, I don't see any recent changes in sun.net.www
that would change this behavior. Similarly, more recent JVMs retain the same behavior, e.g. Java 17:
Am seeing FX File Explorer (network) client use on Android stuck during org.conscrypt.NativeCrypto.SSL_do_handshake(Native Method)
.
Samsung Files (network) client also fails here.
_For my use case, its showing up with Android to Android and is happening on Android 8 test device. Android 14 test device isn't seeing this (NativeCrypto.SSL_dohandshake(Native Method)) with the exact same code pushed out to it.
If SocketInputStream.socketRead0
is still the same issue then that's what I'm seeing on Android 14. I didn't look to see how that's connected.
Update 1: I randomly noticed today that SocketInputStream.socketRead0
can simply show from handshake timeout being set and happening sslSocket.setSoTimeout()
. Where the client side can overcome this as 2 different clients have 2 different results in my case (one works and the other doesn't and so for that one the timeout occurs). For whatever reason, I didn't notice this until now. I don't yet know if this should be definitely be a client problem or not, but in my usage, clients can deal with it.
Update 2: Two clients now working and SocketInputStream.socketRead0
eliminated as a thing from the 2nd. Still have it for some other clients that I now need to look into. I'm definitely so far only seeing this as a client software issue and not as a Conscrypt or Java one. There's going to be a hang if the client causes one :)
Final update: NativeCrypto.SSL_do_handshake(Native Method)
also seen fixed by same client problem as in update 2 for the rest of the clients. Should be fixed via client (in my case). Although, I'm using server side workaround to fix the clients. This is not a workaround to post, as the situation I see is different in code and won't be of any help at all. At the same time, both sides need to communicate properly, so if one side expects something and the other expects something and neither give, then this is what I'm seeing cause both of these two to show up. It could be different problem from the issue or the same. Either way, that would be they're the same or there's something else involved and not the title problem.
We have user reports from Google Cloud Dataproc that threads/tasks would intermittently hang on NativeCrypto.SSL_do_handshake. There were some Java and native thread dumps shared in GoogleCloudDataproc/hadoop-connectors/issues/153, but we were unable to produce a solid reproduction so far.
Dataproc currently uses Conscrypt 1.4.2 on java-8-openjdk-amd64, and on the OS side we have heard from both Debian 9 (Debian-provided OpenJDK) and Debian 10 (AdoptOpenJDK). One user shared with GCP Support that they still saw the same hang even after manually updating to Conscrypt 2.4.0.
Below is an example of the thread dump.