google / conscrypt

Conscrypt is a Java Security Provider that implements parts of the Java Cryptography Extension and Java Secure Socket Extension.
Apache License 2.0
1.28k stars 274 forks source link

Spike in "Trust anchor for certification path not found " errors #881

Closed eygraber closed 3 years ago

eygraber commented 4 years ago

I've been getting a lot of reports from Crashlytics in the past week about the following issue affecting ~10% of my user base. I've never seen this issue myself, and it never happens in our dev server environment.

I ran the ssllabs.com and digicert.com analysis and there were no issues.

We do use cert pinning through Android.

It's happening on Android 7-10 and all the big devices and manufacturers.

network_security_config.xml

<?xml version="1.0" encoding="utf-8"?>
<network-security-config>
  <domain-config>
    <domain includeSubdomains="true">mydomain.com</domain>
    <pin-set>
      <!--
       NOTE recommendation is to have a second certificate to pin just in case the
       main one needs to be invalidated
       -->
      <pin digest="SHA-256">the_hash</pin>
    </pin-set>
  </domain-config>
</network-security-config>

Crash:

Caused by java.security.cert.CertPathValidatorException: Trust anchor for certification path not found.
       at com.android.org.conscrypt.TrustManagerImpl.checkTrustedRecursive(TrustManagerImpl.java:654)
       at com.android.org.conscrypt.TrustManagerImpl.checkTrusted(TrustManagerImpl.java:499)
       at com.android.org.conscrypt.TrustManagerImpl.checkTrusted(TrustManagerImpl.java:422)
       at com.android.org.conscrypt.TrustManagerImpl.getTrustedChainForServer(TrustManagerImpl.java:343)
       at android.security.net.config.NetworkSecurityTrustManager.checkServerTrusted(NetworkSecurityTrustManager.java:94)
       at android.security.net.config.RootTrustManager.checkServerTrusted(RootTrustManager.java:88)
       at com.android.org.conscrypt.Platform.checkServerTrusted(Platform.java:208)
       at com.android.org.conscrypt.ConscryptFileDescriptorSocket.verifyCertificateChain(ConscryptFileDescriptorSocket.java:426)
       at com.android.org.conscrypt.NativeCrypto.SSL_do_handshake(NativeCrypto.java)
       at com.android.org.conscrypt.NativeSsl.doHandshake(NativeSsl.java:383)
       at com.android.org.conscrypt.ConscryptFileDescriptorSocket.startHandshake(ConscryptFileDescriptorSocket.java:231)
       at okhttp3.internal.connection.RealConnection.connectTls(RealConnection.java:367)
       at okhttp3.internal.connection.RealConnection.establishProtocol(RealConnection.java:325)
       at okhttp3.internal.connection.ExchangeFinder.findConnection(ExchangeFinder.java:197)
       at okhttp3.internal.connection.RealConnection.connect(RealConnection.java:197)
       at okhttp3.internal.connection.ExchangeFinder.findConnection(ExchangeFinder.java:249)
       at okhttp3.internal.connection.ExchangeFinder.findHealthyConnection(ExchangeFinder.java:108)
       at okhttp3.internal.connection.ExchangeFinder.find(ExchangeFinder.java:76)
       at okhttp3.internal.connection.RealCall.initExchange$okhttp(RealCall.java:245)
       at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:32)
       at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:100)
       at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:96)
       at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:100)
       at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:83)
       at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:100)
       at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:76)
       at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:100)
       at okhttp3.logging.HttpLoggingInterceptor.intercept(HttpLoggingInterceptor.java:219)
       at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:100)
       at co.twenty.api.CustomPropertyInterceptor.intercept(CustomPropertyInterceptor.java:49)
       at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:100)
       at co.twenty.api.DefaultParamsInterceptor.intercept(DefaultParamsInterceptor.java:44)
       at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:100)
       at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.java:197)
       at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.java:502)
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
       at java.lang.Thread.run(Thread.java:764)
yschimke commented 4 years ago

Did this get resolved?

eygraber commented 4 years ago

I'm still seeing it happen pretty frequently. Found a couple of users that it was happening to, and it seems like it was happening when they were on certain type of wifi networks (e.g. carrier wifi, college campus wifi, etc...).

yschimke commented 4 years ago

Is there a proxy involved?

eygraber commented 4 years ago

From what we can tell there was at least one instance of a proxy being present, but we couldn't get much information about it.

yschimke commented 4 years ago

Question for the conscrypt team, at what layer does the pin-set get applied? Is it hooked into Android internal Conscrypt? bundled Conscrypt? Reading this https://blog.nviso.eu/2019/04/02/circumventing-ssl-pinning-in-obfuscated-apps-with-okhttp/ suggests pinning with OkHttp is usually via OkHttp CertificatePinner.

prbprbprb commented 4 years ago

Apologies for the slow reply!

Certificate pinning in the Android platform (which is what is happening here) is largely external to Conscrypt. There is a higher priority security Provider which installs its own RootTrustManager which is mostly a thin wrapper delegating to Conscrypt but if an app has network security config in its manifest (e.g. certificate pinning) then it deals with that aspect.

The net result is that when Conscrypt is verifying the certificate chain and calls checkServerTrusted() it ends up in NetworkSecurityTrustManager.checkServerTrusted() which first delegates to Conscrypt to get the certificate chain and only then checks for pinning.

In the stack trace above, the exception is thrown whilst building the certificate chain and before pinning is checked, so it's likely that these failures are unrelated to the pinning.

The fact that it's suddenly started on Android 7 to 10 is also interesting as there have been no updates to Conscrypt or RootTrustManager on those platforms recently. Out of curiosity, have you seen any failures on Android 11?

Also, can you share any of your certificate chain (feel free to DM me), and the date the spike started? I think nothing else has changed so I'm wondering if one of the intermediates is cross-signed with both a recently expired certificate and a valid one. There have definitely been bug fixes to Conscrypt's trust chain building in that area but I thought they landed in Android 10 (or maybe even 9).

It's also possible that a subset of your production servers are misconfigured and not returning the correct chain (e.g. missing the intermediates in the chain). Less likely, because then you'd be seeing 100% failures from those servers which ought to stand out.

prbprbprb commented 4 years ago

Hmm, none of the root certificates in Android 10 have expired recently (most recent was May 30) so the cross-signed intermediate theory isn't looking good.

There's also proxies/middleboxes but for those to suddenly affect 10% of your population seems weird.

That said, such breakages do happen... I recently investigated an issue where a noticeable proportion of TLS connections to Android OTA servers were failing that turned out to be a misconfigured traffic shaping middlebox on a telco's network which was sending its own self-signed certificate instead of the target server's.

yschimke commented 4 years ago

Thoughts from @swankjesse was that it sounded like a MITM. Without more evidence, it's hard to do anything with this. The 10% could be because of either audience skew, or traffic skew (increased load when under failing conditions).

Anything like hosts or additional debug would help here.

If you have a user you can pair with, running a debug version of the app to get more information about certificates in the handshake (after setting insecureHost) could help. From OkHttp side I'd like to work out how to make this class of problems easy for app developers to debug without requiring opening a bug report.

eygraber commented 4 years ago

The spike started on August 24th and peaked on Sept 1st, which was also around the time we got ~10k users on our app through an affiliation with a college campus.

The occurrences have tapered off since then to 10s of times a day as opposed to 100s of times a day. Our active users have remained stable.

Our working hypothesis is that it's related to campus wifi, but that doesn't explain the gradual drop off in occurrences. One possible explanation is that the campus IT department may have counseled users to turn off wifi in order to use the app.

There were limited but consistent occurrences of this before Aug 24th; we have no hypothesis as to where those are coming from.

This is the breakdown of which OS versions it's happening on (out of a total of 5k occurrences):

43% Android 10 28% Android 9 19% Android 8 9% Android 7 1% Android 11

This is mostly proportional to the breakdown of OS version for users of the app (other than Android 11 but that's probably because it was released closer to the end of the time range I'm looking at).

I'm fairly confident at this point that it's not an issue with some of our servers being misconfigured, because we've undergone multiple security audits for certified compliance.

I can try to get the certificate chain, and will update if I can.

It's unlikely we'll get that level of interaction with a user that is experiencing the issue. We tried replicating locally in our production and dev environments and we haven't been able to.

yschimke commented 4 years ago

@eygraber If you really want to investigate, the type of solution you'd be looking at is detecting this class of problems, and prompting the user to allow you to collect SSL certificates, connect insecurely so you get a completed handshake and submit the cert chain to a central place. Probably logging to crashlytics?

eygraber commented 4 years ago

That's a good idea. I'll try something like that.

Any tips on how to connect insecurely if the network config has pinning setup?

yschimke commented 4 years ago

I think https://github.com/square/okhttp/blob/master/samples/guide/src/main/java/okhttp3/recipes/kt/DevServer.kt#L40 should work. But actually I'm worried it will just return a 0 length list of peerCertificates

  val clientCertificates = HandshakeCertificates.Builder()
      .addPlatformTrustedCertificates()
      .addInsecureHost(server.hostName)
      .build()

  val client = OkHttpClient.Builder()
      .sslSocketFactory(clientCertificates.sslSocketFactory(), clientCertificates.trustManager)
      .build()

If that doesn't work, I can dig into it a bit more but won't be the top of my list, so something I'd test with to try to improve the default error reporting in these cases, not a short term workaround for you.

prbprbprb commented 4 years ago

Conscrypt is somewhat quirky here, although we may "fix" this in a far future[2] release, in that it adds the the unverified certificates to the handshake SSLSession before verification[1]. So it ought to be possible to automate collection at the okhttp RealConnection level, possibly based on a flag or user setting. I.e. if socket.startHandshake() throws a certificate exception then socket.getHandshakeSession.getPeerCertificates() ought to return the failing certificate chain (untested, sorry). This is only true if the SSLSocket layer is Conscrypt... If the app is using Conscrypt's TrustManager with another transport layer such as Netty (e.g. gRPC does this) then the certificates won't be present. But it's a start.

I have to say I'm leaning more towards the middlebox theory too now that it corresponds with a spike in users, but without a failing certificate chain it's difficult to prove.

[1] I think largely due to the historical use of HTTPSUrlConnection.HostnameVerifier on Android where only the SSLSession is passed in for verification. See the commentary on #867 for more gorey details. [2] Won't be any time soon because we can't break compatibility for apps that rely on [1]

yschimke commented 4 years ago

Yep - grabbing the Conscrypt socket is demonstrated in some of our AndroidTests

https://github.com/square/okhttp/blob/master/android-test/src/androidTest/java/okhttp/android/test/OkHttpTest.kt#L167

      client = OkHttpClient.Builder()
          .eventListener(object : EventListener() {
        override fun connectionAcquired(call: Call, connection: Connection) {
          socketClass = connection.socket().javaClass.name
        }
      })

      response.use {
        assertEquals(Protocol.HTTP_2, response.protocol)
        assertEquals(200, response.code)
        // see https://github.com/google/conscrypt/blob/b9463b2f74df42d85c73715a5f19e005dfb7b802/android/src/main/java/org/conscrypt/Platform.java#L613
        when {
            Build.VERSION.SDK_INT >= 24 -> {
              // Conscrypt 2.5+ defaults to SSLEngine-based SSLSocket
              assertEquals("org.conscrypt.Java8EngineSocket", socketClass)
            }
            Build.VERSION.SDK_INT < 22 -> {
              assertEquals("org.conscrypt.KitKatPlatformOpenSSLSocketImplAdapter", socketClass)
            }
            else -> {
              assertEquals("org.conscrypt.ConscryptFileDescriptorSocket", socketClass)
            }
        }
        assertEquals(TlsVersion.TLS_1_3, response.handshake?.tlsVersion)
      }
...

So you can definitely get the Conscrypt socket, and therefore the handshake session.

prbprbprb commented 4 years ago

Fab... Note that if you are going to log the chain for debugging (either okhttp or @eygraber in his app) you'll want at least the Subject, Issuer, dates and any DNS subject alternative names.

yschimke commented 4 years ago

Just confirmed you should be able to add an insecure host then make a request, capture the socket and read the certificates. Works on both Android and JDK.

https://github.com/square/okhttp/compare/master...yschimke:noplatformcerts?expand=1

Key bits are

      val clientCertificates = HandshakeCertificates.Builder()
          .addInsecureHost("httpbin.org")
          .build()

      client = OkHttpClient.Builder()
          .eventListenerFactory(clientTestRule.wrap(object : EventListener() {
        override fun connectionAcquired(call: Call, connection: Connection) {
          val socket = connection.socket() as SSLSocket

          socket.session.peerCertificates.forEach {
            val cert = (it as X509Certificate)

            println(cert.subjectDN)
            println(cert.issuerDN)
            println(cert.notAfter)
            println(cert.subjectAlternativeNames.orEmpty().toList())
          }
        }
      }))

If you keep digging and find the answer, please follow up with the explanation or confirmation of a MITM on the college network. Or badly configured proxy/dns etc.

daulet commented 3 years ago

Please reopen if this is still an issue.

eygraber commented 3 years ago

@yschimke I am testing out my solution for this issue, and I'm using the following to simulate an https issue (not the one we're seeing in production, but just something to test the flow). It's failing when I have R8 enabled. By adding the following to my proguard config, the issue goes away. Just wanted to confirm that this would be the correct way to handle it:

-keepclassmembers class * implements javax.net.ssl.X509TrustManager {
  public java.util.List checkServerTrusted(java.security.cert.X509Certificate[], java.lang.String, java.lang.String);
}
yschimke commented 3 years ago

That makes sense, it comes from here

https://github.com/square/okhttp/blob/67f77be6b098efa0a8271b557891130eb7d83f5f/okhttp-tls/src/main/kotlin/okhttp3/tls/internal/InsecureAndroidTrustManager.kt#L37-L52

  /** Android method to clean and sort certificates, called via reflection. */
  @Suppress("unused", "UNCHECKED_CAST")
  fun checkServerTrusted(
    chain: Array<out X509Certificate>,
    authType: String,
    host: String
  ): List<Certificate> {
    if (host in insecureHosts) return listOf()
    try {
      val method = checkServerTrustedMethod
          ?: throw CertificateException("Failed to call checkServerTrusted")
      return method.invoke(delegate, chain, authType, host) as List<Certificate>
    } catch (e: InvocationTargetException) {
      throw e.targetException
    }
  }
eygraber commented 3 years ago

Should that rule be added to OkHttp, or is this not something that's supposed to be used in production code (aside from cases like testing something)?

eygraber commented 3 years ago

Should that rule be added to OkHttp, or is this not something that's supposed to be used in production code (aside from cases like testing something)?

yschimke commented 3 years ago

100% it's for test/dev only, but I don't think we assumed that proguard was the way to avoid that being used.

https://github.com/square/okhttp/blob/482f88300f78c3419b04379fc26c3683c10d6a9d/samples/guide/src/main/java/okhttp3/recipes/kt/DevServer.kt

  val clientCertificates = HandshakeCertificates.Builder()
      .addPlatformTrustedCertificates()
      .addInsecureHost(server.hostName)
      .build()

  val client = OkHttpClient.Builder()
      .sslSocketFactory(clientCertificates.sslSocketFactory(), clientCertificates.trustManager)
      .build()

    /**
     * Configures this to not authenticate the HTTPS server on to [hostname]. This makes the user
     * vulnerable to man-in-the-middle attacks and should only be used only in private development
     * environments and only to carry test data.
     *
     * The server’s TLS certificate **does not need to be signed** by a trusted certificate
     * authority. Instead, it will trust any well-formed certificate, even if it is self-signed.
     * This is necessary for testing against localhost or in development environments where a
     * certificate authority is not possible.
     *
     * The server’s TLS certificate still must match the requested hostname. For example, if the
     * certificate is issued to `example.com` and the request is to `localhost`, the connection will
     * fail. Use a custom [HostnameVerifier] to ignore such problems.
     *
     * Other TLS features are still used but provide no security benefits in absence of the above
     * gaps. For example, an insecure TLS connection is capable of negotiating HTTP/2 with ALPN and
     * it also has a regular-looking handshake.
     *
     * **This feature is not supported on Android API levels less than 24.** Prior releases lacked
     * a mechanism to trust some hosts and not others.
     *
     * @param hostname the exact hostname from the URL for insecure connections.
     */
    fun addInsecureHost(hostname: String) = apply {
      insecureHosts += hostname
    }
eygraber commented 3 years ago

We recently released a version with this debugging in, and got our first hit!

Non-fatal Exception: javax.net.ssl.SSLPeerUnverifiedException: Hostname <my host> not verified:
    certificate: sha256/<redacted>
    DN: CN=securelogin.<some university>.edu,O=<name of university>,L=<city of university>,ST=<state of university>,C=US
    subjectAltNames: [securelogin.<some university>.edu, www.securelogin.<some university>.edu]

Looks like the university network is MITM us (I'm assuming it's a configuration thing, and not intentional because we partner with them, unless there's a bad actor somewhere).

I'll keep the debugging in for a few versions and see if anything else interesting pops up.

yschimke commented 3 years ago

@eygraber If you are building something intranet focused, you can probably be extra helpful here. When your app runs this, pop up a webview. But you shouldn't have to do this.

I suspect there is some Wifi web auth process that is meant to kick in automatically. e.g. pop up a login dialog when the wifi connects but without internet access.

eygraber commented 3 years ago

I think that particular case is a wash for us (we're not building for their intranet).

I'd say it's very likely that this is the main cause of the errors that we're seeing, since it spiked around the time we partnered with that campus.

eygraber commented 3 years ago

Just started getting reports that seem like they should work. Had a bunch of network requests fail on an LG Stylo 5 running Android 9, not rooted.

subjectDN=OU=Go Daddy Class 2 Certification Authority, O="The Go Daddy Group, Inc.", C=US issuerDN=OU=Go Daddy Class 2 Certification Authority, O="The Go Daddy Group, Inc.", C=US subjectAlternativeNames=[] notBefore=Tue Jun 29 13:06:20 EDT 2004 notAfter=Thu Jun 29 13:06:20 EDT 2034 now=2020-11-29T13:28:36.167-05:00[America/New_York]

eygraber commented 3 years ago

Here is what I get from https://www.digicert.com/help/  

Subject | *..co Valid from 10/Apr/2020 to 06/Apr/2021 Issuer | Go Daddy Secure Certificate Authority - G2   |     Subject | Go Daddy Secure Certificate Authority - G2 Valid from 03/May/2011 to 03/May/2031 Issuer | Go Daddy Root Certificate Authority - G2   |     Subject | Go Daddy Root Certificate Authority - G2 Valid from 01/Jan/2014 to 30/May/2031 Issuer |     Subject |   Valid from 29/Jun/2004 to 29/Jun/2034 Issuer |  

chadbrubaker commented 3 years ago

Are you seeing that failure only on that specific device? All Android devices share the same CA's so device specific issues are pretty rare in my experience.

Just to check, as its often a common cause of failures, is the server serving the entire certificate chain including intermediates? While browsers cache intermediates and generally cover for missing certs the conscrypt trust manager only caches in your process, so its much less likely to work with missing intermediates.

eygraber commented 3 years ago

It looks like our wildcard cert was configured differently than our regular one. We're looking into it, but that seems to be this issue.

eygraber commented 3 years ago

Turns out the wildcard cert was configured correctly, and is serving the full chain including intermediates, and this is still an issue.

There are a few requests that fail because they're getting MITM (campus wifi, corporate firewalls, etc...), but with our logging running for a few months now, it is clear that a large majority of the failures shouldn't have failed. Since November 15th 2020 we've had apprx 4k unique users run into this issue 15k times (we have 20k daily active users).

I tweaked the debug code to also log the certificate itself, and I compared the chain to our cert paths and everything matches. We don't allow rooted devices to use the app, and this is happening across every manufacturer and OS version we support (Android 7 - Android 11).

eygraber commented 3 years ago

@yschimke any ideas? We're seeing this issue at a rate of 1.5k per week, and we've exhausted all ideas on our side.

chadbrubaker commented 3 years ago

You don't see any particular bias toward specific versions/devices? Client-side issues should be tightly grouped to specific versions (E.g. pre-N the chain building is very naive and you need to be careful about the order of your intermediates), but a random smattering is very weird and usual more indicative of a server issue in my experience.

The only possible client-side cause I can think of might be seeing it is if the clock on these devices is sufficiently off that your valid chain is failing the date check, are you logging what the device thinks the time is?

eygraber commented 3 years ago

As far as I can tell it's completely random:

54%
samsung
16%
Google
11%
motorola
9%
LGE
10%
Other (7)
38%
Android 10
26%
Android 11
18%
Android 9
12%
Android 8
6%
Other (1)

Those numbers line up with our overall distribution.

Regarding time, I ran some tests locally and it looks like a different exception is thrown. In any case I just looked at a sampling of the issues and all of the device times look correct.

chadbrubaker commented 3 years ago

Very odd, can you share the URL or the bag of certificates you're serving? You can email it to me at cbrubaker (at) google if you don't want to post them there.

The only two sources of per-client behavior that aren't tied to things like versions of specific device bugs are the date check and user's disabling trust in trust anchors, but that second thing has effectively zero usage and wouldn't explain the amounts you're seeing. How are you preventing non-standard devices? Its not uncommon for emulators and roms to masquerade to get past basic root checking.

eygraber commented 3 years ago

Sent an email.

We're not doing anything fancy to prevent root, so it's definitely something that can be worked around. I doubt that there's that volume of users doing that though.

larssn commented 2 months ago

I assume this was continued via email. Did you guys ever find out what caused it?