Open gawertm opened 1 year ago
Tested auditing of the content of /etc/kubernetes/pki
with Falco 0.36.
It work, although the logs are only printed in the standard output of the Falco pods.
Giantswarm offers Falco as an installable app, but the version of Falco in this app is 0.35.x, which is missing convenience selectors to define rules that deal with file system. Defining rules to audit access to files before v0.36 is much more complicated according to the documentation.
Next steps:
Falco app v0.7.0 can be used to audit accesses to the root CA private key in control places. Falco Exporter can expose the audit logs as metrics to Prometheus.
However, in order to achieve this both Falco and Falco Exporter would have to be installed on all clusters and they would have to be configured to run on control planes.
Falco Exporter is not able to run due to the current security constraints defined in the Falco app.
Reached out to Team Shield with some questions related to the above: https://gigantic.slack.com/archives/C02FL4EAADD/p1701337202749149
Feedback from Team Shield: In general it is OK to run Falco and Falco Exporter on control plane nodes and use them to audit access to the root CA key. However, a few concerns were raised:
Next steps:
Use a fresh installation of Falco without the unsanitized default rules defined in the Falco GS app
Performance:
Falco has specific resource requests and limits defined. It requests 0.1 CPU and 512 Mi of memory, and the limit is set to 1 CPU and 1 Gi of memory.
Falco runs a pod on each node in the cluster, so the resource requests and limits also apply to each individual node.
Falco Exporter does not define any resource requests or limits by default. However, in the default values it provides a suggested default resource requests of 0.1 CPU and 120 Mi of memory.
During tests in testing WCs the observed resource consumption was lower than the limit or the request.
Successfully tested the use of Falco Exporter to expose logs to Prometheus.
Log messages from Falco appeared in Prometheus running on the MC and can be used to fire an alert.
However, many details from the log messages were omitted in the exposed Prometheus metrics. It would be nice to have the additional details because they can be used to filter out legitimate uses of the root CA key and prevent the alert from firing in such cases.
There may be other ways to expose the logs from Falco with full details and configure an alert based on them. One of these ways is the observability bundle and Loki. Team Atlas currently works on a solution, which will allow any app installed in the WCs to send its logs to Loki and make them available in our Grafana. Once this solution is ready, we can check it out and see if it is suitable for our use case.
Finished testing the initial rule definition.
Reached out to Team Shield to determine what is the best way to add the rule to the Falco app, and concluded that it should be defined in the default values of the app.
Opened a PR in falco-app
, which adds the rule to the default values: https://github.com/giantswarm/falco-app/pull/261
A new rule, which enables Falco to log access to the root CA key file has been merged to the Falco App. Next steps:
As a short term to mitigate the risk of the root CA private key being stored on control plane nodes of the workload clusters, we need to look into auditing: