twitter / finagle

A fault tolerant, protocol-agnostic RPC system
https://twitter.github.io/finagle
Apache License 2.0
8.79k stars 1.46k forks source link

No TLS Session re-use when using `Netty4ClientEngineFactory` (default) #874

Open spockz opened 4 years ago

spockz commented 4 years ago

No TLS Session re-use when using Netty4ClientEngineFactory (default)

Expected behavior

When establishing a second connection between the same client and server instance the TLS sessions should be re-used.

Actual behavior

Each connection gets its own TLS session when configuring the client with SslClientConfiguration.

Using the default SslClientEngineFactory leads to using the Netty4ClientEngineFactory which has this behaviour. Configuring the sslcontext directly with .withTransport.tls(SSLContext) leads to using the SslContextClientEngineFactory which does work.

Steps to reproduce the behavior

See testcases: https://github.com/spockz/finagle-tls-session-reuse/blob/master/src/test/scala/com/github/spockz/finagle/it/tls/TlsTest.scala, output:

[info] TlsTest:
[info] Tls on server and client
[info] - should support session resumption with ssl context
[info]   + ###
[info]   + ### First call
[info]   + Return(Response("HTTP/1.1 Status(201)"))
[info]   + ###
[info]   + ###Second call
[info]   + Return(Response("HTTP/1.1 Status(201)"))
[info] - should support session resumption with SslClientConfiguration *** FAILED ***
[info]   Set("95-106-2055-9833-123101-3168-65-93-93-24-491537-3922-7-81906-11856-24-46789086113-50", "95-106-2055-11-1116-99-80-323326-2274-11950127-12765-29289-128-28-32-4825-73-101-33-8343") had size 2 instead of expected size 1 (TlsTest.scala:95)
[info]   + ###
[info]   + ### First call
[info]   + Return(Response("HTTP/1.1 Status(201)"))
[info]   + ###
[info]   + ###Second call
[info]   + Return(Response("HTTP/1.1 Status(201)"))
[info] - should support session resumption with SslClientConfiguration read from PEMS *** FAILED ***
[info]   Set("95-106-205563-679748-3372574941-19-1081-10398-709-417954-39-123-11647-4-10-5120", "95-106-2055-110-85446125371-8667-1041063569-5110-124119-184487-54-1118641-43-2792104") had size 2 instead of expected size 1 (TlsTest.scala:95)
[info]   + ###
[info]   + ### First call
[info]   + Return(Response("HTTP/1.1 Status(201)"))
[info]   + ###
[info]   + ###Second call
[info]   + Return(Response("HTTP/1.1 Status(201)"))
[info] - should support session resumption with SslClientConfiguration and a cached Netty4ClientEngineFactory *** FAILED ***
[info]   Throw(Failure(renegotiation unsupported at remote address: localhost/127.0.0.1:62739. Remote Info: Not Available, flags=0x08) with RemoteInfo -> Upstream Address: Not Available, Upstream id: Not Available, Downstream Address: localhost/127.0.0.1:62739, Downstream label: somelabel, Trace Id: e188359a301a902a.6fad2b7b5ed57202<:9511f135ae05febd with Service -> somelabel) was not an instance of com.twitter.util.Return, but an instance of com.twitter.util.Throw (TlsTest.scala:91)
[info]   + ###
[info]   + ### First call
[info]   + Return(Response("HTTP/1.1 Status(201)"))
[info]   + ###
[info]   + ###Second call
[info]   + Throw(Failure(renegotiation unsupported at remote address: localhost/127.0.0.1:62739. Remote Info: Not Available, flags=0x08) with RemoteInfo -> Upstream Address: Not Available, Upstream id: Not Available, Downstream Address: localhost/127.0.0.1:62739, Downstream label: somelabel, Trace Id: e188359a301a902a.6fad2b7b5ed57202<:9511f135ae05febd with Service -> somelabel)
[info] - should support session resumption with SslClientConfiguration and a SslContextEngineFactory
[info]   + ###
[info]   + ### First call
[info]   + Return(Response("HTTP/1.1 Status(201)"))
[info]   + ###
[info]   + ###Second call
[info]   + Return(Response("HTTP/1.1 Status(201)"))

Possible fixes

  1. I tried to cache the created Engine in https://github.com/spockz/finagle-tls-session-reuse/blob/master/src/test/scala/com/github/spockz/finagle/it/tls/TlsTest.scala#L147 which leads to the error Failure(renegotiation unsupported... This appears to be because the engine is already destroyed.
  2. Using the SslContextClientEngineFactory mitigates the session issue, but this drops back to using JSSE for TLS which drops performance on JDK8 and loses H2.
ryanoneill commented 4 years ago

So, I don't believe you need to cache the Engine but rather the Context. My understanding is that if the Context is cached, then it will work. We provide the Netty4ClientEngineFactory and Netty4ServerEngineFactory as the defaults because they're the safest in terms of correctness, but they aren't the best in terms of performance.

As I mentioned in gitter, we mostly use ones which are tied to our internal security infrastructure. They create a single context and reuse it, because the parameters are known ahead of time. We also have one which precreates (some) contexts and then chooses the appropriate one based on the situation. I believe you would want something similar. I'm talking to the team now about seeing if we can/should open source that.

ryanoneill commented 4 years ago

Hey @spockz, yesterday I moved our ExternalClientEngineFactory into OSS Finagle. It's here. I recommend taking a look at it, and perhaps using it in your tests to see if it exhibits the behavior that you're expecting. If so, we can talk about how to write one that completely suits your use case.

spockz commented 4 years ago

Thank you for the example @ryanoneill. Caching the netty SslContext indeed does make more sense as the context creates the engine. Based on the input I create my own ExternalContextClientEngineFactory which gets a preconstructed netty SslContext passed in. I also configured the client with the SslContextClientEngineFactoryClient that takes a javax SSLContext. The results are astonishing to me. The SslContextClientEngineFactory is ±2.5x faster than the netty based client engine factory. Moreover, when making many connections the SslContextClientEngineFactory is 10x faster per connection, indicating re-use. The ExternalContextClientEngineFactory is as slow as the standard Netty4 based one.

[info] Benchmark                                                                   Mode  Cnt         Score         Error   Units
[info] SslConnectionPerformanceBench.externalSslContextClientEngineFactoryClient  thrpt   10        ≈ 10⁻⁷                ops/ns
[info] SslConnectionPerformanceBench.netty4ClientEngineFactoryClient              thrpt   10        ≈ 10⁻⁷                ops/ns
[info] SslConnectionPerformanceBench.sslContextClientEngineFactoryClient          thrpt   10        ≈ 10⁻⁶                ops/ns
[info] SslConnectionPerformanceBench.externalSslContextClientEngineFactoryClient   avgt   10  14935413.067 ±  607647.575   ns/op
[info] SslConnectionPerformanceBench.netty4ClientEngineFactoryClient               avgt   10  14478829.727 ± 1718539.667   ns/op
[info] SslConnectionPerformanceBench.sslContextClientEngineFactoryClient           avgt   10   1259970.414 ±  163441.278   ns/op
[info] SslConnectionPerformanceBench.externalSslContextClientEngineFactoryClient     ss   10  23296255.900 ± 2266746.677   ns/op
[info] SslConnectionPerformanceBench.netty4ClientEngineFactoryClient                 ss   10  23388622.700 ± 1750611.302   ns/op
[info] SslConnectionPerformanceBench.sslContextClientEngineFactoryClient             ss   10   8867196.300 ± 2869120.745   ns/op

Do you have any explanations for this? I had to copy some finagle private classes btw.

We provide the Netty4ClientEngineFactory and Netty4ServerEngineFactory as the defaults because they're the safest in terms of correctness, but they aren't the best in terms of performance.

Can you share a bit more on why it is more "correct" and "safe" to use a context per connection instead of re-using it? Is it because settings might have been changed on either side since the last connection was established?