Failure to connect to server with Let's Encrypt certificate

jcoglan commented 2 years ago

Hi there 👋 This might be weird issue, I'm opening it because I use the SSL verification code from this library in faye-websocket and just discovered an issue with it.

The Faye website is https://faye.jcoglan.com/ and its TLS certificates are provided by Let's Encrypt's certbot. I found that the verification code rejected it even though my browsers think the site is fine, so I wanted to run this past someone else familiar with this code to see what you think.

If you run this code, you get a TLS failure:

require 'bundler/setup'
require 'faraday'

connection = Faraday.new('https://faye.jcoglan.com') do |conn|
  conn.adapter(:em_http)
end

connection.get('/')

To debug this, I made some adjustments to my copy of the TLS verification code and ended up with this demo that connects successfully:

require 'bundler/setup'
require 'eventmachine'
require 'openssl'

HOST = 'faye.jcoglan.com'

module Connection
  def connection_completed
    @cert_store  = OpenSSL::X509::Store.new
    @last_cert   = nil
    @last_verify = false

    @cert_store.set_default_paths

    start_tls(sni_hostname: HOST, verify_peer: true)
  end

  def ssl_verify_peer(cert_text)
    certificate = OpenSSL::X509::Certificate.new(cert_text)

    @last_cert   = certificate
    @last_verify = @cert_store.verify(certificate)

    p certificate

    if @last_verify
      store_cert(certificate)
      puts "\e[37;42m VERIFIED \e[0m"
    else
      puts "\e[37;41m UNVERIFIED \e[0m"
    end

    true
  end

  def ssl_handshake_completed
    unless @last_verify and OpenSSL::SSL.verify_certificate_identity(@last_cert, HOST)
      raise OpenSSL::SSL::SSLError,
            %(host "#{HOST}" does not match the server certificate)
    end
  end

  def store_cert(cert)
    @cert_store.add_cert(cert)
  rescue OpenSSL::X509::StoreError => error
    raise error unless error.message == 'cert already in hash table'
  end
end

EM.run { EM.connect(HOST, 443, Connection) }

The main way this differs from the implementation in Faraday is that it's ok for @cert_store.verify(certificate) to fail. If it fails, the certificate is not added to the X509::Store, but we don't raise an exception immediately. Instead we just require that the last certificate in the chain is verified, and that it passed hostname checking.

When I run this script, I see:

#<OpenSSL::X509::Certificate: subject=#<OpenSSL::X509::Name CN=ISRG Root X1,O=Internet Security Research Group,C=US>, issuer=#<OpenSSL::X509::Name CN=DST Root CA X3,O=Digital Signature Trust Co.>, serial=#<OpenSSL::BN:0x000000010f1df190>, not_before=2021-01-20 19:14:03 UTC, not_after=2024-09-30 18:14:03 UTC>
[ UNVERIFIED ]

#<OpenSSL::X509::Certificate: subject=#<OpenSSL::X509::Name CN=R3,O=Let's Encrypt,C=US>, issuer=#<OpenSSL::X509::Name CN=ISRG Root X1,O=Internet Security Research Group,C=US>, serial=#<OpenSSL::BN:0x000000010f1decb8>, not_before=2020-09-04 00:00:00 UTC, not_after=2025-09-15 16:00:00 UTC>
[ VERIFIED ]

#<OpenSSL::X509::Certificate: subject=#<OpenSSL::X509::Name CN=faye.jcoglan.com>, issuer=#<OpenSSL::X509::Name CN=R3,O=Let's Encrypt,C=US>, serial=#<OpenSSL::BN:0x000000010f1de808>, not_before=2021-10-21 15:10:22 UTC, not_after=2022-01-19 15:10:21 UTC>
[ VERIFIED ]

We have 3 certs here:

A: CN=ISRG Root X1,O=Internet Security Research Group,C=US
B: CN=R3,O=Let's Encrypt,C=US
C: CN=faye.jcoglan.com

The first cert, A, is not verified given the default paths for X509::Store. However, cert B is verified, and then cert C is also verified if cert B has been added to the store.

If A is added to the store, this causes cert C to fail verification. Likewise if cert B is not added, cert C fails to verify.

What I'm wondering is: is the implementation above valid? Is it ok for cert A to fail, if cert B passes verification and can then go on to verify C? Does this mean there is some chain of certs from my machine's trusted certs to the site's own cert, and that means it can be trusted? Or should the failure of cert A cause the connection to fail?

jsha commented 2 years ago

(wound up here from Twitter)

Certificate validation is a fairly tricky subject and is ideally left up to the TLS library. In particular I think the issue you're asking about is the difference between the chain (what the server sends you) and the path (what your certificate validator builds based on the leaf certificate, the untrusted contents of the chain, and our root store). Here are some blogs about that:

https://medium.com/@sleevi_/path-building-vs-path-verifying-implementation-showdown-39a9272b2820 https://medium.com/@sleevi_/path-building-vs-path-verifying-the-chain-of-pain-9fbab861d7d6

So I can get some context on the question: are you doing validation by hand as a way to work around verification failure? Or were you already doing validation for another reason, and are trying to debug why this specific case fails?

The first cert, A, is not verified given the default paths for X509::Store

That's because it's a cross-signed root, and the root it's signed by (DST Root CA X3) is expired, and macOS enforces that expiration. Well-behaved validation code should ignore the fact that A doesn't validate, so long as it can build a valid path to a trusted root (even if that path doesn't use any certificates from the chain).

jcoglan commented 2 years ago

So I can get some context on the question: are you doing validation by hand as a way to work around verification failure? Or were you already doing validation for another reason, and are trying to debug why this specific case fails?

@jsha I am using code from this library in faye-websocket to implement certificate verification, because eventmachine doesn't do that itself in its start_tls API. More context is in this blog post: https://blog.jcoglan.com/2020/07/31/missing-tls-verification-in-faye/

I am not an expert in this stuff and my priority is to make sure the library I'm shipping is safe. If it turns out Ruby's openssl bindings are at fault here for not trusting the certificates my site is presenting, so be it.

jcoglan commented 2 years ago

I would note that net/http succeeds in fetching the same URL, i.e. this script succeeds:

require 'uri'
require 'net/http'

uri = URI.parse('https://faye.jcoglan.com/')

puts Net::HTTP.get_response(uri)

I'm not sure what net/http does different and can't figure out how it uses the X509::Store class from browsing the code.

jcoglan commented 2 years ago

@jsha Your article reflects roughly my understanding of what's going on, without being an expert -- I assumed that the client is given a set of certificates, and as long as it can find some path from a cert it trusts (something in the X509 default store) to the server's cert, things are ok.

Does that mean my script is doing something valid when it sees that it can verify cert 2, and use that to verify cert 3, and therefore cert 3 is trustworthy, even though cert 1 is untrusted? Or have I misunderstood?

jsha commented 2 years ago

My Ruby is a bit rusty and in particular I'm not familiar with EventMachine, but I suspect net/http doesn't reference X509::Store because it lets OpenSSL do the verification itself - by making sure verify_mode is VERIFY_PEER. In that situation OpenSSL will use whatever the default trust store is, but you can override that with ca_path / ca_file.

I see you have this call:

start_tls(sni_hostname: HOST, verify_peer: true)

Is that not enough to turn on certificate validation? Why do you also need to implement ssl_verify_peer?

I assumed that the client is given a set of certificates, and as long as it can find some path from a cert it trusts (something in the X509 default store) to the server's cert, things are ok.

This is how things are supposed to work, but per the article don't quite work that way. And the default behavior differs based on which version of OpenSSL. What version are you running?

jcoglan commented 2 years ago

Is that not enough to turn on certificate validation? Why do you also need to implement ssl_verify_peer?

It's an implementation decision by EventMachine not to fully implement certificate verification itself, but to provide a callback for the client to do it, because different protocols that run atop TLS have different requirements here, I believe. Faraday originally spotted this and implemented the verification code I'm referring to in this issue.

What version are you running?

I'm not actually sure which version Ruby is linked against. I'm on macOS 11.6 and installed ruby using sudo ruby-install -r /opt/rubies ruby 2.7.5 no other configuration involved.

jsha commented 2 years ago

  def ssl_verify_peer(cert_text)
    certificate = OpenSSL::X509::Certificate.new(cert_text)

    @last_cert   = certificate
    @last_verify = @cert_store.verify(certificate)

One thing I notice is missing from the call to verify() is the chain (see https://ruby-doc.org/stdlib-2.5.1/libdoc/openssl/rdoc/OpenSSL/X509/Store.html). That will be required for this to work right; but as I said before, if you can avoid overriding ssl_verify_peer, and just rely on verify_peer: true, that is much better.

jcoglan commented 2 years ago

I have just checked this and to confirm: passing verify_peer: true to start_tls is not sufficient to check the hostname. If you skip the validation in the other callbacks and connect to a server with broken certificates, the connection succeeds.

jsha commented 2 years ago

Ah, and now that I'm coming to understand this code a bit better:

  def store_cert(cert)
    @cert_store.add_cert(cert)
  rescue OpenSSL::X509::StoreError => error
    raise error unless error.message == 'cert already in hash table'
  end

I'm pretty sure this is incorrect. This adds the peer's certificate to your store of trusted roots. In general you should not need to modify the Store after you have initialized it with a set of trusted roots.

jcoglan commented 2 years ago

One thing I notice is missing from the call to verify() is the chain

I think the chain is built internally by adding certs to the store via OpenSSL::X509::Store#add_cert. If I don't add cert 2 to the store in this way, then cert 3 fails to verify.

jcoglan commented 2 years ago

I'm pretty sure this is incorrect. This adds the peer's certificate to your store of trusted roots. In general you should not need to modify the Store after you have initialized it with a set of trusted roots.

Ah ok, let me try passing the chain as you suggest and see what happens.

jcoglan commented 2 years ago

This produces the same result: cert 1 is untrusted, the other two are trusted:

require 'bundler/setup'
require 'eventmachine'
require 'openssl'

HOST = 'faye.jcoglan.com'

module Connection
  def connection_completed
    @cert_store  = OpenSSL::X509::Store.new
    @last_cert   = nil
    @last_verify = false

    @cert_store.set_default_paths

    start_tls(sni_hostname: HOST, verify_peer: true)
  end

  def ssl_verify_peer(cert_text)
    certificate = OpenSSL::X509::Certificate.new(cert_text)

    @last_cert   = certificate
    @last_verify = @cert_store.verify(certificate, @cert_store.chain)

    p certificate

    if @last_verify
      puts "\e[37;42m[ VERIFIED ]\e[0m"
    else
      puts "\e[37;41m[ UNVERIFIED ]\e[0m"
    end

    true
  end

  def ssl_handshake_completed
    unless @last_verify and OpenSSL::SSL.verify_certificate_identity(@last_cert, HOST)
      raise OpenSSL::SSL::SSLError,
            %(host "#{HOST}" does not match the server certificate)
    end
  end
end

EM.run { EM.connect(HOST, 443, Connection) }

jsha commented 2 years ago

Looking at the EventMachine code, I start to see the problem. It has a different meaning for verify_peer than OpenSSL (including Ruby's OpenSSL bindings). In OpenSSL when you turn on "verify_peer", it means "verify peer using the default validator and default roots." However, in EventMachine "verify_peer" means "verify peer using a custom callback." This is a bad API shape, since custom certificate validation is very hard to get right, as you're seeing:

https://github.com/eventmachine/eventmachine/blob/8e1d6b11fd8400593af035a7a0d203d24c10c9b0/ext/ssl.cpp#L423-L428

EventMachine makes a callback in C++, which AFAICT calls out to the Ruby callback: https://github.com/eventmachine/eventmachine/blob/8e1d6b11fd8400593af035a7a0d203d24c10c9b0/ext/ssl.cpp#L675-L698

Ideally that Ruby callback should take a cert and a chain. But it seems to only take a cert. Also, if you look at the second chunk of C code, it gets the cert using X509_STORE_CTX_get_current_cert. I'm not that familiar with OpenSSL's API here, but those docs say "X509_STORE_CTX_get_current_cert() returns the certificate which caused the error or NULL if no certificate is relevant to the error."

So it seems like there are two API bugs in EventMachine:

it should support a default certificate validator which uses OpenSSL's validator
for custom validators, it needs to pass the cert and the chain to to callback

And possibly one implementation bug:

it is getting the cert from the wrong call. I think it probably meant to use SSL_get_peer_certificate

From your code:

@last_verify = @cert_store.verify(certificate, @cert_store.chain)

This seems incorrect. The docs for Store::chain say:

The certificate chain constructed by the last call of verify.

In other words, it's not the chain sent by the peer.

It's not immediately clear to me that you can fix this without an upstream fix in EventMachine.

By the way, thanks for working on this! Certificate validation code can be a real slog to work with, but I'm glad you're helping make the ecosystem more secure.

jsha commented 2 years ago

It looks like EventMachine does expose an accessor get_peer_cert: https://www.rubydoc.info/github/eventmachine/eventmachine/EventMachine%2FConnection:get_peer_cert. But it doesn't expose the corresponding get_peer_chain it would need in order to make it feasible to write your own certificate validation. (And again, it's unreasonable for EventMachine to say users must implement their own certificate validation; it should allow them to delegate that thankless task to OpenSSL)

jcoglan commented 2 years ago

Yeah this reminds me of a lot of what you'll find if you dig in the issue threads linked in https://blog.jcoglan.com/2020/07/31/missing-tls-verification-in-faye/ that go back years. Long story short: this isn't getting fixed in EventMachine. It's possible the ideal solution is to migrate off of EM entirely, but that would break everything depending on my libraries. My priority is to make this as safe as reasonably possible given EM is probably not going to change its behaviour.

jcoglan commented 2 years ago

Do you know any way I could make this safer? Is it fundamentally bad to ignore the verification failure on cert 1? Or am I fine letting this connection through because I managed some string of successful verify() calls to end at the site's own certificate?

jcoglan commented 2 years ago

The explanation from Let's Encrypt about what these certs are for seems to imply clients are expected to accept either the first or second cert in the trace above, and so failing on one of them is normal?

jsha commented 2 years ago

I would ignore the Let's Encrypt blog post, and indeed my early speculation on this thread. The problem isn't really about the cross-signed root expiration.

The fundamental problem is this: in ssl_verify_peer you have a leaf certificate (aka end-entity certificate). In the default root store you have a set of roots. In between there is a whole universe of intermediates that are trusted only by virtue of being signed by a root.

In order to validate a peer certificate, you need to build a path from the leaf certificate, through some number of intermediates, to some root certificate. But where do you get those intermediates? Normally they are in the cert chain provided during the TLS handshake, but EventMachine neglected to allow you access to those.

In theory you could bundle a list of common intermediates and give those to OpenSSL as untrusted inputs to the verification process. But that's very fragile - you'd need to update the intermediates all the time. CAs change their intermediates regularly and without advance notice, and expect clients to still be able to verify the new intermediate (via the TLS chain mechanism).

You could provide a way for your users to input a list of regularly-updated intermediates. Still fragile but at least provides a way to fix things when they break.

You could offer your users an interface that says "verify the peer certificate by checking its bytes match this hardcoded certificate we expect." But that's also fragile - it would break every time the peer rotated its certificate, which should be often.

Here's a really hacky idea: you could make your own connection to the peer, collect the chain, and provide that as untrusted input to the certificate verification.

jcoglan commented 2 years ago

Normally they are in the cert chain provided during the TLS handshake, but EventMachine neglected to allow you access to those.

Isn't this what EM is providing when it repeatedly invokes ssl_verify_peer on the connection? For this one connection, that method is called three times, and one of those certs (2) is trusted by my default root certs. Then it signs cert 3 (the leaf) and so the site is trusted.

Have I misunderstood something here?

In the case of my library there is also an escape hatch in the form of an option letting the caller provide their own root CA file, and we fall back to set_default_paths if that's not used.

jsha commented 2 years ago

Isn't this what EM is providing when it repeatedly invokes ssl_verify_peer on the connection? For this one connection, that method is called three times, and one of those certs (2) is trusted by my default root certs. Then it signs cert 3 (the leaf) and so the site is trusted.

Aha, that makes sense. I was misunderstanding the API. So, it may be possible to do something reasonable here. I can't say exactly what the right thing is at the moment, but it should be possible. I can take a closer look later this week.

jcoglan commented 2 years ago

Thanks @jsha, this was really helpful. Absolutely no obligation to dig into this any further :)

jsha commented 2 years ago

The rough idea I'm thinking of is: accumulate the chain as you receive each subsequent call to ssl_verify_peer, and then verify the whole thing in ssl_handshake_completed. But note that I'm pretty unfamiliar with all these APIs, so please confirm with the documentation and so on. :-)

mackuba commented 6 months ago

Hey @jcoglan, hello from 2024 - have you managed to figure this one out? 😅

I'm using EventMachine, faye-websocket and now also em-http-request in my project, and I've come across an issue that I suspect might be the same thing as you've discussed here…

The specific URL I'm trying to load is this one: https://genco.me/.well-known/did.json

open-uri / Net::HTTP loads it just fine. But the em-http-request code (copied from faraday apparently) throws: OpenSSL::SSL::SSLError unable to verify the server certificate for "genco.me". I thought this was maybe a matter of not setting the certificate store file properly, but it seems to be using the same kind of store as Net::HTTP in the same way.

I don't know that much about SSL certificate verification and to be honest I'd prefer not to, since I think this is something that's best not to do unless you're sure you're doing it right - I don't understand why EventMachine can't just delegate to the same verification code that Net::HTTP is using… or maybe it can be told to?

jcoglan commented 5 months ago

@mackuba It looks like that domain uses Let's Encrypt certificates which was also the reason that faye-websocket had to change its SSL verifier. We originally based ours on the Faraday one, which wanted to check that every certificate received was trusted. This is not always the case; since not all clients agree on their default set of trusted CAs, some servers will send you multiple copies of their certificate signed by different root CAs, hoping you'll trust one of them, and then you can form a chain from there to your domain's certificate.

This is the change we made in https://github.com/faye/faye-websocket-ruby/commit/d9428fae72240bfa8b98627a0414524814b92f22: rather than verify every certificate, we check that there is some chain of certificates from the client's trusted root CAs to the domain certificate, and that the final certificate matches the hostname for the request. The current implementation of our verifier is here: https://github.com/faye/faye-websocket-ruby/blob/0.11.3/lib/faye/websocket/ssl_verifier.rb

I'm not sure why this is not baked into EventMachine but there's not much we can do about this. This does something very similar to what's inside Net::HTTP, particularly the use of OpenSSL::X509::Certificate, OpenSSL::X509::Store#{set_default_paths,verify,add_cert}, and OpenSSL::SSL.verify_certificate_identity.

mackuba commented 5 months ago

@jcoglan ok, thanks! Tbh, I've since then rewritten this code again and moved back to sync HTTP with Net::HTTP, because I've found a way to do it this way and it makes the code simpler 😉 But this should be helpful if I decide to use em-http again 👍

lostisland / faraday-em_http

Failure to connect to server with Let's Encrypt certificate #5