giantswarm / roadmap

Giant Swarm Product Roadmap
https://github.com/orgs/giantswarm/projects/273
Apache License 2.0
3 stars 0 forks source link

Alerting on Root CA key access #3224

Open gawertm opened 7 months ago

gawertm commented 7 months ago

Trigger and alert and possible also a page in case Falco logs an unauthorized access to the root CA key files.

In case the root CA key is accessed, the root CA (an all certificates signed by it) needs to be rotated. That is not a trivial process, especially since the CA is also stored in a secret on the MC, so it needs to be rotated both in the WC and the MC.

The best course of action to replace the leaked CA key is yet to be determined, It is possible that a new WC will have to be created and all data and workloads will need to be migrated from the old WC to the new one.

vvondruska commented 3 months ago

Regarding possible ways to trigger an alert based on messages logged by Falco - there are several ways. The most straightforward one is to expose the Falco logs as metrics in Prometheus. This can be achieved by installing and running Falco Exporter next to Falco. The main downside of this approach is that Prometheus metrics created from Falco logs sometimes do not contain all details available in the log messages, and therefore it may not be possible to filter out legitimate access to the root CA key based on data available in the Prometheus metrics.

Another way to trigger alerts could be to send the log messages from Falco to Loki, and trigger the alerts from there. It should be possible to get logs with full details from Falco to Loki, so more data should be available to compose rules for the alerts.

It is also possible to filter out the legitimate access directly in Falco. That way Falco would only log a message in case it detects an illegitimate access to the CA file, and therefore all log messages could safely trigger an alert. So, if the filter is defined in the Falco rule, the straightforward way to fire alerts can be used.

vvondruska commented 3 months ago

Another thing to consider is performance. Exposing metrics to Prometheus requires Falco Exporter, which is an additional workload that consumes resources available in the WC, whereas sending logs to Loki may already work out of the box without the need to install any additional components. However, Falco Exporter is enabled by default in the Falco app configuration, so it is also present in the WC by default.

So, we can take the straightforward approach to alerting and go with Prometheus, and in case we encounter performance issues we can change to Loki.

vvondruska commented 2 months ago

As an initial solution metrics will be exposed to Prometheus via Falco Exporter (which is enabled by default). Legitimate access to the root CA private key file will be filtered out directly in the Falco rule definition.

Later on we may switch to firing alerts based on logs.

vvondruska commented 2 months ago

Gathering info about processes that require legitimate access to the root CA private key, so that they can be filtered out in the Falco rule.

vvondruska commented 1 month ago

Tried a few ways to force legitimate access to the root CA private key, but none was logged so far. It may be because the Kubernetes component that signs the certificates reads the key when it starts, caches it in memory and loads it from the cache when it needs to sigh a new certificate. If that is the case, then the private key file does not need to be read every time there is a need to sigh a new certificate, and as a result there will be very few legitimate accesses recorded by Falco, if any. This is still to be verified.

vvondruska commented 1 month ago

Tested a few more things and still did not manage to trigger a legitimate access to the root CA key file. So, I think for now we can set up alerts based on the current Falco rule, which logs all access to the key file and adjust them later in case we get false alarms. The alerting should only be active during business hours for some time to make sure that it works as expected.