Closed cpu closed 4 months ago
This seems like it would be a good time to try and move the Android jobs to Linux runners instead. We would need to enable KVM as well to keep decent performance.
Sounds reasonable :+1:
I was able to unbreak my local dev environment by pinning a specific android-ndk version (23.1.7779620) and when run locally the missing real world test suite panic logs do appear. It's the Let's Encrypt real world verification suite that seems to be failing in the Android VM (if we assume my local env matches what happens in CI). The relevant part was:
03-11 16:21:38.616 6704 6723 I rustls_platform_verif..: -----------------------------------------------------------------------------
03-11 16:21:38.616 6704 6731 I rustls_platform_verif..: verifying ref ID "letsencrypt.org" expected Ok(())
03-11 16:21:38.624 6704 6731 W rustls_platform_verif..: certificate was not trusted: java.security.cert.CertificateException: Chain validation failed
03-11 16:21:38.624 6704 6731 E rustls_platform_verif..: failed to verify TLS certificate: invalid peer certificate: UnknownIssuer
03-11 16:21:38.624 6704 6731 E rustls_platform_verif..: test panic: assertion `left == right` failed
03-11 16:21:38.624 6704 6731 E rustls_platform_verif..: left: Err(InvalidCertificate(UnknownIssuer))
03-11 16:21:38.624 6704 6731 E rustls_platform_verif..: right: Ok(())
03-11 16:21:38.625 6704 6723 E rustls_platform_verif..: real world: test failed
03-11 16:21:38.625 6704 6723 I rustls_platform_verif..: -----------------------------------------------------------------------------
The unknown issuer error is curious since (AIUI) the Android trust store shouldn't have changed out from under us and the chain is baked into the test.
On a lark I tried running the test data update script to fetch an updated chain. Only the leaf EE cert changed, and after updating the mock verification time to be after the new EE cert's NotBefore
the test suite started passing for me locally. It also fixed the test in CI. Diffing the two EE certs I didn't spot anything obvious changing between them except things you would expect (validity period, SCTs, etc).
I'll put a PR with the updated test data, but something feels fishy here. For one thing, the verification time and chain are both intended to be fixed, so there shouldn't be an expiry issue. Even if the time is not fixed somehow in practice, why is it reporting as an unknown issuer error instead of an expiry error? Lastly, why are logs not showing up in CI when they work locally? I think there's something more worth understanding here... Edit: it's also curious the same tests w/ the same chain are passing on the other platforms.
With the linked PRs merged, the only items remaining in this issue are to determine:
Unknown Issuer
Why the mock verification time needed changed for CI to pass
I think I understand that one: the new end-entity certificate has a NotBefore date that was after the previous fixed mock verification time:
[daniel@blanc:~/Code/Rust/rustls-platform-verifier]$ git rev-parse HEAD
d68c2edd4c5fb2aa4cde57f472de059e6966f64c
[daniel@blanc:~/Code/Rust/rustls-platform-verifier]$ openssl x509 -inform der -in rustls-platform-verifier/src/tests/verification_real_world/letsencrypt_org_valid_1.crt -noout -text | grep "Not Before"
Not Before: Jan 27 21:30:22 2024 GMT
Our mock verification time was set to Wednesday, January 3, 2024 6:03:08 PM prior to the update, and so the new EE leaf was considered as having been issued in the future and not yet valid.
Why it failed with Unknown Issuer
This is indeed still a mystery. I'm going to try and investigate deeper this afternoon.
Why it failed with Unknown Issuer
This is indeed still a mystery. I'm going to try and investigate deeper this afternoon.
It turns out there's a simple explanation for what we've seen.
The fixed verification time is used up front on a call to X509Certificate.checkValidity
, but not referenced in the later call to X509TrustManager.checkServerTrusted
. The catch here is that checkServerTrusted
also enforces certificate validity.
Since checkServerTrusted
is using real wallclock time once the vendored Let's Encrypt certificate expired (~Feb 26th) it started returning an exception java.security.cert.CertPathValidatorException
for the expired Let's Encrypt end-entity certificate. The first failing cron-invoked CI job was on Feb 27th so that tracks.
We get an unknown issuer error instead of a more sensible error about the expiry because while the earlier checkValidity
call maps a caught exception into a StatusCode.Expired
result, the catch
for the checkServerTrusted
call unconditionally converts it to a StatusCode.UnknownCert
result:
In this instance e.toString()
is the message from the outermost generic CertificateException
and just says "Chain validation failed". You'd need to dig in deeper to the nested exceptions to see the underlying certificate expiry message.
The fixed verification time avoids the checkValidity
catch that would return StatusCode.Expired
and falls into the checkServerTrusted
catch that always returns StatusCode.UnknownCert
.
That resolves the remaining mysteries but leaves a question about the best fix. It doesn't look like we can set the trust manager's reference time for checkServerTrusted
:-/
That resolves the remaining mysteries but leaves a question about the best fix. It doesn't look like we can set the trust manager's reference time for checkServerTrusted :-/
At the very least I think we could at least return a better error so that in combination with the fixed CI logs it should be very obvious what the problem is (test data certificate expired) and what the fix is (run the tooling to update the test data). In practice the error won't be surfaced by this part of the code outside of tests since the earlier call to checkValidity
will do the right thing before invoke the trust manager.
At the very least I think we could at least return a better error so that in combination with the fixed CI logs it should be very obvious what the problem is
Here's one attempt at this: https://github.com/rustls/rustls-platform-verifier/pull/75
I personally think it's worth landing this while we figure out a better solution to use a static verification time throughout, but I also acknowledge it's kind of gross :crying_cat_face:
Edit: probably https://github.com/rustls/rustls-platform-verifier/issues/59 or similar to fix the verification time issue is a better long term solution.
I'm going to close this now. I think the open PRs and #59 are probably enough to track any additional work here.
Sounds good and thank you for diving into this. If nothing else we've made some great Android testing improvements so far as a result even if we haven't solved everything yet.
It looks like the MacOS CI jobs running the Android virtual device tests started failing on
main
~yesterday (one, two). Kicking the jobs doesn't resolve the failure, so it's probably not a transient flake.Unfortunately the mentioned Rust panic logs are not being passed through (similar to https://github.com/rustls/rustls-platform-verifier/pull/27).