k8ssandra / management-api-for-apache-cassandra

RESTful / Secure Management Sidecar for Apache Cassandra
Apache License 2.0
71 stars 51 forks source link

SSL Context Hot Reloading doesn't work as expected #507

Closed lincolnlog4 closed 2 months ago

lincolnlog4 commented 3 months ago

SSLContext reload does not seem to be updated when cert files are changed. It seems that even though a log message is displayed that changes have been detected in the files, the actual SSLContext for the server remains unchanged.

Steps to reproduce.

  1. Create a short lived cert that will expire in a few minutes to use as a cert for the server
  2. Update cert to a longer lived cert
  3. Verify the Detected change in the SSL/TLS certificates, reloading log occurs
  4. Execute a local curl command e.g. curl --cert tls.crt --key tls.key --cacert ca.crt https://localhost:8080/api/v0/lifecycle/pid with updated certs
  5. Output of above command will be SSL certificate problem: certificate has expired even though the certs should have been updated to the long-lived cert

Restarting the management api process (and therefore restarting Cassandra) will result in the process picking up the new cert, however, it would be great if the restart was not necessary

lincolnlog4 commented 3 months ago

I think the io.grpc.util.AdvancedTlsX509KeyManager could also be used and built directly into the SslContext rather than the manual watcher that is set up now. Something like:

AdvancedTlsX509KeyManager keyManager = new AdvancedTlsX509KeyManager();
keyManager.updateIdentityCredentialsFromFile(tlsCert, tlsKey, 10, MINUTES, executor);
SslContextBuilder.forServer(keyManager)
            .trustManager(tlsCaCert)
            .clientAuth(ClientAuth.REQUIRE)
            .ciphers(null, IdentityCipherSuiteFilter.INSTANCE)
            .build();
lincolnlog4 commented 3 months ago

I see @burmanm implemented the original hot reloading, any thoughts?

burmanm commented 3 months ago

Hey, I haven't had time to investigate this yet. I think the message "SSL certificate problem: certificate has expired" is displayed by the curl even if the server would have correct certificates - if the client certificates have expired (or the image's system certificates), so alone it doesn't tell us much sadly which certificates are old.

The gRPC's same code wouldn't make any difference, because it would simply end up with the same situation (it's essentially doing the same thing). If there's a bug, it would be in not using the new SSLContext, since clearly it would still somehow load the old ones. Also, the construction of the SSLContext should be correct since you got it to work the first time as it's the same code. Thus, the possible bug would have to be in the Netty pipeline using the old SSLContext object. This is something that could be logged and verified though.

lincolnlog4 commented 3 months ago

In regards to the curl command, the provided client certs were not expired as they were updated to the long-lived certs and this was verified using openssl. The server reports an error in the SSLHandshake (no mention of expired certs) as the client connection initiates a close of the connection when it has determined the server cert is expired.

I believe the gRPC code and the existing logic actually have slightly different behaviors. The gRPC code woud use the KeyManager of the original SSLContext to monitor/update the server's certificates so that when they are updated it will be reflected in the original SSLContext that was used to start the server. The existing hot reload code attempts to create a new SSLContext and replace the existing one, but it is not possible to set a new SSLContext unless the server is restarted.

Here's a relevant stack overflow for reloading SSLContexts: https://stackoverflow.com/questions/51411594/reload-netty-servers-ssl-context-for-grpc

burmanm commented 3 months ago

I don't think that's needed in this case, it seems overkill. Since we use initChannel, it should reload on every new connection. Thus, it seems the only bug is that the context to the SSLContext was incorrect one and we need to keep updating it instead of simply refreshing the SSLContext.