randombit / botan

Cryptography Toolkit
https://botan.randombit.net
BSD 2-Clause "Simplified" License
2.58k stars 566 forks source link

Asynchronous Certificate Validation #2368

Open hrantzsch opened 4 years ago

hrantzsch commented 4 years ago

In our TLS client, we encountered an issue with the synchronous nature of Callbacks::tls_verify_cert_chain. We were wondering how this could be addressed.

Let me briefly describe our use case: We override the default tls_verify_cert_chain mostly in order to use our own network stack. The function fetches certificates and makes OCSP requests, so it can be long-running. Unfortunately, we cannot profit from our asynchronous networking implementation and need to block a thread, because tls_verify_cert_chain needs to return synchronously.

We saw that issue #1856 asks for a similar enhancement and refers to the BoringSSL API. It turns out that BoringSSL provides and API for asynchronous verify callbacks as well:

To verify a certificate asynchronously, the callback may return ssl_verify_retry. The handshake will then pause with SSL_get_error returning SSL_ERROR_WANT_CERTIFICATE_VERIFY.

In Botan this would probably mean that TLS::Channels would throw/forward a specific exception, indicating it wants to be called again, if tls_verify_cert_chain says so. I'm not sure where exactly we'd need to pick up though, as the whole process is triggered by received_data.

Do you think it's worthwhile to provide such an interface and do you have any ideas regarding how we could best implement it?

As a side note, we considered less C-style ways to get our asynchronous callback as well. For instance, we could split received_data into two methods, one to provide the data and one method step, which users would call repeatedly while providing them with a way to add hooks. Or we could add a continuation_callback(verification_result) parameter to tls_verify_cert_chain, and split the process_*_msg methods that call tls_verify_cert_chain. But both ways seem less elegant though and would in fact be much harder to integrate with our asio stream.

randombit commented 4 years ago

This is definitely worth working towards - the number of TLS/DTLS users I know of in embedded contexts (eg, no threads and IP over CAN bus) is high and I'm sure all would appreciate improvements here.

My first thought would be to add a new callback which returns a future<void> representing the result (for discussion lets call it tls_async_verify_cert_chain). The default implementation of tls_async_verify_cert_chain would simply invoke tls_verify_cert_chain and then return the future. The default impl could even do this (when threads are available) in a sub-thread with std::async, which would remove OCSP retrieval from the critical path in normal use. Then we stash the future in the handshake state and finalize it just before considering the session to be established.

This would change somewhat the visible behavior of the stack in that a bad certificate might not cause an alert until later in the handshake. But since a bad cert is causing a fatal alert in any case, it probably does not matter too much as long as it happens before other callbacks (such as tls_session_established) are invoked.

Splitting up received_data into multiple functions doesn't seem viable since that would require changing all callers.

A continuation would be fine, though it would require a new callback so as to not break existing code. TBH probably the whole state machine would be simpler and easier to understand with a continuation based approach, though I don't have the stomach for that kind of refactoring right now.

Writing this up though, a simpler idea occurs to me. We don't actually require that tls_verify_cert_chain do anything at all. So what you can do is in your callback implementation, take the certificates and any relevant policy information, and queue it up for verification. In a later callback (say tls_session_established) you can check the result and at that point throw an exception out of the callback which will be translated into an alert. You would probably need to block at that point, if the verification hadn't completed, since you don't want to send application data to the wrong server. But that actually seems to match the characteristics of my first future proposal, but without having to change the library itself at all. Do you think this would work?

Ideally, we would not invoke tls_session_established until both streams of data (the ones associated with OCSP/cert retrieval + the main handshake flow) are completed. This would seem to require the library have logic for fully async cert verification, which would be useful in its own right.

hrantzsch commented 4 years ago

Thanks for the quick response and the suggestions.

Parallelizing OCSP/cert retrieval and the main handshake flow would already be a nice optimization, but at least for us it is actually more important to have tls_verify_cert_chain non-blocking. Both returning a future and delaying the decision until tls_session_established would be (potentially) blocking at some point.

We'd be interested in trying out if we can adapt the code without breaking the interface. We're just a little scared of the complexity of the TLS state machine. On that note, how do you assess the test coverage here? Are we good if tests are green?

I'll outline what we have in mind (for the client side).

tls_verify_cert_chain would use a specific exception to communicate that the handshake was paused for cert validation and the user wants to pick it up later again manually. So the user's code might look like this:

class CustomCallbacks : public TLS::Callbacks
   {
   void tls_verify_cert_chain(validationData) override
      {
      if (certCache.contains(validationData))
         {
         verify_or_throw(validationData);
         }
      else
         {
         certCache.prepare_async(validationData)  // fetch stuff asynchronously
             .then(client.continue_handshake());  // schedule a non-blocking continuation

         throw Botan::retry_handshake_exception(verify_certs_first);
         }
      }
   };

In tls::Client / tls::Channel we'd need to provide the continue_handshake interface (how to pick up state TBD). Furthermore, process_handshake_msg would need to be aware of the specific retry_handshake_exception, not causing a TLS alert but simply exit (possibly store some state?).

process_handshake_msg would then need to be able to pick up at a certain point in the state machine. It would either need to find out where to continue on its own, or continue_handshake would then need a way of telling it. Conceptually, we would manually make process_handshake_msg a "co-routine" that can yield control flow to the using code (i.e. don't block a thread with a synchronous callback) and resume where it left off after some user-defined action is performed.

This approach should be generic enough for us to add additional breaks later on, e.g. for tls_sign_message.

As for the interface of received_data, there shouldn't be any changes to existing code. As long as the using code doesn't throw the specific new exception in the callback, the method will simply continue in its current, synchronous fashion.

randombit commented 4 years ago

On that note, how do you assess the test coverage here? Are we good if tests are green?

To be honest the builtin tests (botan-test tls) are basically just smoke tests, can we complete a handshake with ourselves in varying circumstances. Something could be very wrong and those tests might not catch it. The shim-based tests using BoringSSL's hacked test stack are pretty good though. I'd like to get TLS-Attacker and/or tlsfuzzer into CI as well. And/or build my own test TLS stack, but I definitely do not have time for that right now.

hrantzsch commented 4 years ago

After looking into this a bit more, we didn't come up with a robust solution that doesn't require a larger refactoring. I'll just document our findings for the future.

The goal was to allow using code to interrupt the TLS handshake in order to verify certificates asynchronously. In the approach we tried, interrupting the handshake would be implemented by providing a custom exception, say verify_retry_exception, that using code could throw in it's implementation of the tls_verify_cert_chain callback.

In case of the TLS client, Botan calls tls_verify_cert_chain during the TLS handshake in Client::process_handshake_msg. In order to interrupt the handshake, this method would handle the verify_retry_exception by making all preparations needed to pick up the handshake again later and return control to the using code. Allowing the handshake to be picked up at this point is the tricky part here, as we must not only track the state of processing the handshake message(s), but also the state of processing the data in the user-provided buffer passed to received_data.

We got it kind-of-working by simply having the client remember how much data from the user-provided buffer it has already consumed whenever the interrupting exception comes up. However, this seems quite brittle. We'd have to make a lot of assumptions, such as that the user has to call received_data again with the exact same data buffer, when they want to continue the handshake.

We decided to leave it at that for now. Regardless of this specific issue, let us know when you have any concrete tasks we could address to help with the planned refactoring of the TLS internals.