hashicorp / vault-plugin-auth-kubernetes

Vault authentication plugin for Kubernetes Service Accounts
https://www.vaultproject.io/docs/auth/kubernetes.html
Mozilla Public License 2.0
206 stars 62 forks source link

Ensure a consistent TLS configuration #173

Closed benashz closed 1 year ago

benashz commented 1 year ago

Previously, it was possible for the http.Client's Transport to be missing the necessary root CAs to ensure that all TLS connections between the auth engine and the Kubernetes API were validated against a configured set of CA certificates.

This fix ensures that the http.Client's Transport is always consistent with the configured CA cert chain, by introducing a periodic TLS configuration checker that is started as part of the backend's initialization.

Other fixes:

Closes #169 #170

anthonyralston commented 1 year ago

Hey @benashz, do you have an ETA for when this PR will make it out in a Vault release? As mentioned in https://github.com/hashicorp/vault-plugin-auth-kubernetes/issues/169, we're seeing this issue in our production Vault cluster.

benashz commented 1 year ago

Hi @anthonyralston , we are hoping to have this PR merged this week. I can't say for sure when the Vault release will be out. We should have more information about that soon.

tsaarni commented 1 year ago

I'd be curious to learn why is timer based polling needed?

In the scenario where the CA certificate is read from a local file, the logic in this PR might have some unintentional interactions with the previous logic in cachingFileReader.ReadFile(). The reader was introduced to periodically reload the token (and CA) from disk (#122). The caller always calls ReadFile() and in case when the file was already loaded, then "fast path" implementation returns a copy from memory. When configured caching period has passed, then it proceeds with "slow path" and re-reads it from the disk.

If I understood correctly, the go routine in this PR would result in a call ReadFile() once every 30 seconds? Since caching for CA cert file is set up for 1 hour, wouldn't it mean that actual disk read still happens only every 120th tick i.e. once every hour? https://github.com/hashicorp/vault-plugin-auth-kubernetes/blob/eabe60240605c4cc3c2d73038931c2fbf47ff6aa/backend.go#L44-L46

In general, I suppose CA certificate is not updated too often. It is likely very long-lived certificate. Long(ish) 1 hour period made sense if having reload at all. I guess the whole reload logic only became relevant now due to the error in storage backend - causing it not to be loaded at all.

I guess the TLSClientConfig could simply be initialized each time getHTTPClient() is called (either from caching reader or config) without too much impact on performance?

benashz commented 1 year ago

If I understood correctly, the go routine in this PR would result in a call ReadFile() once every 30 seconds? Since caching for CA cert file is set up for 1 hour, wouldn't it mean that actual disk read still happens only every 120th tick i.e. once every hour?

In the case where the CA source is a local file, then yes. Ideally the read would only happen whenever the file has changed, not based on an arbitrary TTL. There would only ever be a 30 second delay between CA certificate rotation in that case.

I guess the TLSClientConfig could simply be initialized each time getHTTPClient() is called (either from caching reader or config) without too much impact on performance?

Yes, that is definitely one approach, but we would prefer to have the setup/update of the tlsConfig be decoupled from the HTTP client's use.

tsaarni commented 1 year ago

In the case where the CA source is a local file, then yes. Ideally the read would only happen whenever the file has changed, not based on an arbitrary TTL. There would only ever be a 30 second delay between CA certificate rotation in that case.

In the other use case, where CA is in config: does it make sense to poll storage every 30 seconds, since storage cannot change without us knowing about it?

tsaarni commented 1 year ago

In the case where the CA source is a local file, then yes. Ideally the read would only happen whenever the file has changed, not based on an arbitrary TTL. There would only ever be a 30 second delay between CA certificate rotation in that case.

In the other use case, where CA is in config: does it make sense to poll storage every 30 seconds, since storage cannot change without us knowing about it?

Or is it so that it will poll storage every 30 seconds, regardless of CA being on disk or in config? Is that OK in performance wise?

benashz commented 1 year ago

In the case where the CA source is a local file, then yes. Ideally the read would only happen whenever the file has changed, not based on an arbitrary TTL. There would only ever be a 30 second delay between CA certificate rotation in that case.

In the other use case, where CA is in config: does it make sense to poll storage every 30 seconds, since storage cannot change without us knowing about it?

It is possible that a config update might fail to update the TLS config, in this case the poller will ensure that the issue is mitigated without the need for user intervention. We feel that the cost of an extra storage request and cert parsing is worth it here.

tsaarni commented 1 year ago

Ok, fair enough! The issue could be avoided e.g. by checking only at getHTTPClient() when the config is anyways fetched - and which is also the only moment when the TLSClientConfig is actually used. I was thinking maybe the original problem is relatively infrequent issue, and therefore not worthy of having a background job that will run constantly regardless of the information being used (or changed).