rustls / rustls-platform-verifier

A certificate verification library for rustls that uses the operating system's verifier
Apache License 2.0
57 stars 18 forks source link

MacOS/Android AVD CI failures #68

Closed cpu closed 4 months ago

cpu commented 4 months ago

It looks like the MacOS CI jobs running the Android virtual device tests started failing on main ~yesterday (one, two). Kicking the jobs doesn't resolve the failure, so it's probably not a transient flake.

> Task :rustls-platform-verifier:connectedDebugAndroidTest
Starting 3 tests on test(AVD) - 9

org.rustls.platformverifier.CertificateVerifierTests > runRealWorldTestSuite[test(AVD) - 9] FAILED 
    org.junit.ComparisonFailure: A test failed. Check the logs above for Rust panics. expected:<[success]> but was:<[test failed!]>
    at org.junit.Assert.assertEquals(Assert.java:115)
Tests on test(AVD) - 9 failed: There was 1 failure(s).

Unfortunately the mentioned Rust panic logs are not being passed through (similar to https://github.com/rustls/rustls-platform-verifier/pull/27).

complexspaces commented 4 months ago

This seems like it would be a good time to try and move the Android jobs to Linux runners instead. We would need to enable KVM as well to keep decent performance.

cpu commented 4 months ago

Sounds reasonable :+1:

cpu commented 4 months ago

I was able to unbreak my local dev environment by pinning a specific android-ndk version (23.1.7779620) and when run locally the missing real world test suite panic logs do appear. It's the Let's Encrypt real world verification suite that seems to be failing in the Android VM (if we assume my local env matches what happens in CI). The relevant part was:

03-11 16:21:38.616  6704  6723 I rustls_platform_verif..: -----------------------------------------------------------------------------
03-11 16:21:38.616  6704  6731 I rustls_platform_verif..: verifying ref ID "letsencrypt.org" expected Ok(())
03-11 16:21:38.624  6704  6731 W rustls_platform_verif..: certificate was not trusted: java.security.cert.CertificateException: Chain validation failed
03-11 16:21:38.624  6704  6731 E rustls_platform_verif..: failed to verify TLS certificate: invalid peer certificate: UnknownIssuer
03-11 16:21:38.624  6704  6731 E rustls_platform_verif..: test panic: assertion `left == right` failed
03-11 16:21:38.624  6704  6731 E rustls_platform_verif..:   left: Err(InvalidCertificate(UnknownIssuer))
03-11 16:21:38.624  6704  6731 E rustls_platform_verif..:  right: Ok(())
03-11 16:21:38.625  6704  6723 E rustls_platform_verif..: real world: test failed
03-11 16:21:38.625  6704  6723 I rustls_platform_verif..: -----------------------------------------------------------------------------

The unknown issuer error is curious since (AIUI) the Android trust store shouldn't have changed out from under us and the chain is baked into the test.

On a lark I tried running the test data update script to fetch an updated chain. Only the leaf EE cert changed, and after updating the mock verification time to be after the new EE cert's NotBefore the test suite started passing for me locally. It also fixed the test in CI. Diffing the two EE certs I didn't spot anything obvious changing between them except things you would expect (validity period, SCTs, etc).

I'll put a PR with the updated test data, but something feels fishy here. For one thing, the verification time and chain are both intended to be fixed, so there shouldn't be an expiry issue. Even if the time is not fixed somehow in practice, why is it reporting as an unknown issuer error instead of an expiry error? Lastly, why are logs not showing up in CI when they work locally? I think there's something more worth understanding here... Edit: it's also curious the same tests w/ the same chain are passing on the other platforms.

complexspaces commented 4 months ago

With the linked PRs merged, the only items remaining in this issue are to determine:

cpu commented 4 months ago

Why the mock verification time needed changed for CI to pass

I think I understand that one: the new end-entity certificate has a NotBefore date that was after the previous fixed mock verification time:

[daniel@blanc:~/Code/Rust/rustls-platform-verifier]$ git rev-parse HEAD
d68c2edd4c5fb2aa4cde57f472de059e6966f64c

[daniel@blanc:~/Code/Rust/rustls-platform-verifier]$ openssl x509 -inform der -in rustls-platform-verifier/src/tests/verification_real_world/letsencrypt_org_valid_1.crt -noout -text | grep "Not Before"
            Not Before: Jan 27 21:30:22 2024 GMT

Our mock verification time was set to Wednesday, January 3, 2024 6:03:08 PM prior to the update, and so the new EE leaf was considered as having been issued in the future and not yet valid.

Why it failed with Unknown Issuer

This is indeed still a mystery. I'm going to try and investigate deeper this afternoon.

cpu commented 4 months ago

Why it failed with Unknown Issuer

This is indeed still a mystery. I'm going to try and investigate deeper this afternoon.

It turns out there's a simple explanation for what we've seen.

The fixed verification time is used up front on a call to X509Certificate.checkValidity, but not referenced in the later call to X509TrustManager.checkServerTrusted. The catch here is that checkServerTrusted also enforces certificate validity.

Since checkServerTrusted is using real wallclock time once the vendored Let's Encrypt certificate expired (~Feb 26th) it started returning an exception java.security.cert.CertPathValidatorException for the expired Let's Encrypt end-entity certificate. The first failing cron-invoked CI job was on Feb 27th so that tracks.

We get an unknown issuer error instead of a more sensible error about the expiry because while the earlier checkValidity call maps a caught exception into a StatusCode.Expired result, the catch for the checkServerTrusted call unconditionally converts it to a StatusCode.UnknownCert result:

https://github.com/rustls/rustls-platform-verifier/blob/d68c2edd4c5fb2aa4cde57f472de059e6966f64c/android/rustls-platform-verifier/src/main/java/org/rustls/platformverifier/CertificateVerifier.kt#L256

In this instance e.toString() is the message from the outermost generic CertificateException and just says "Chain validation failed". You'd need to dig in deeper to the nested exceptions to see the underlying certificate expiry message.

The fixed verification time avoids the checkValidity catch that would return StatusCode.Expired and falls into the checkServerTrusted catch that always returns StatusCode.UnknownCert.

That resolves the remaining mysteries but leaves a question about the best fix. It doesn't look like we can set the trust manager's reference time for checkServerTrusted :-/

cpu commented 4 months ago

That resolves the remaining mysteries but leaves a question about the best fix. It doesn't look like we can set the trust manager's reference time for checkServerTrusted :-/

At the very least I think we could at least return a better error so that in combination with the fixed CI logs it should be very obvious what the problem is (test data certificate expired) and what the fix is (run the tooling to update the test data). In practice the error won't be surfaced by this part of the code outside of tests since the earlier call to checkValidity will do the right thing before invoke the trust manager.

cpu commented 4 months ago

At the very least I think we could at least return a better error so that in combination with the fixed CI logs it should be very obvious what the problem is

Here's one attempt at this: https://github.com/rustls/rustls-platform-verifier/pull/75

I personally think it's worth landing this while we figure out a better solution to use a static verification time throughout, but I also acknowledge it's kind of gross :crying_cat_face:

Edit: probably https://github.com/rustls/rustls-platform-verifier/issues/59 or similar to fix the verification time issue is a better long term solution.

cpu commented 4 months ago

I'm going to close this now. I think the open PRs and #59 are probably enough to track any additional work here.

complexspaces commented 4 months ago

Sounds good and thank you for diving into this. If nothing else we've made some great Android testing improvements so far as a result even if we haven't solved everything yet.