prometheus / prometheus

The Prometheus monitoring system and time series database.
https://prometheus.io/
Apache License 2.0
55.73k stars 9.16k forks source link

Prometheus Security Risk Report #11290

Closed XDTG closed 2 years ago

XDTG commented 2 years ago

The metric mechanism of Prometheus has potential risks of information leakage and DoS attack. Prometheus collects metrics from targets by scraping metrics HTTP endpoints. However, these exposed HTTP endpoints will bring security risks of information leakage and DoS attacks to the targets. Below I will list some specific cases:

  1. Calico: Calico maintains a calico-kube-controller pod on the Kubernetes control plane and exposes metrics on 9094 TCP port by a Kubernetes service calico-kube-controllers-metrics. This service allows all containers in the entire cluster to visit by default. It contains information leakage and DoS threat.

    1. Information leakage: This service leaks sensitive information about the Kubernetes cluster. For example, the metrics returned by the calico-kube-controller include the IPAM IP pool messages. These messages include Kubernetes node names and the number of IPs allocated to pods running on each node. As a result, a malicious container can infer the cluster topology by node names and the number of pods running on each node.
    2. DoS threat: The calico-kube-controller runs on the Kubernetes control plane and has no Linux control group limits by default. A malicious container can send a large number of requests in several seconds, making the calico-kube-controller greatly consume the control plane’s resources. In our experiment, a malicious container sends a large number of HTTP GET requests by curl, making the calico-kube-controller consumes 80% of the entire control plane CPU resources.
    3. DoS threat: Even the control plane leverage Linux control groups to limit the physical resource consumption of the calico-kube-controller. A malicious container can fill up the control plane’s nf_conntrack table, making the control plane node drop packages randomly. Linux kernel’s networking stack uses connection tracking to keep track of all logical network connections or flows, and the kernel maintains a table to record the detailed information of each connection. The total connections have a limit. Even though the containers on the control plane node are in different Linux network namespaces, all of their connections need to consume the init_net.ct.count of the init_net namespace of the control plane node. Therefore if one can generate a large number of TCP connections on the control plane node in a short time, it can consume all quota of init_net.ct.count, causing the control plane’s Netfilter malfunction. In our experiment, a malicious container produces a large number of TCP connections and fills the control plane’s nf_conntrack table in a few minutes, which causes the control plane node to drop packages randomly.
  2. CoreDNS: The CoreDNS service uses the 9153 TCP port to expose service metrics information, which all containers in the Kubernetes cluster can access. It contains information leakage and DoS threat.

    1. Information leakage: The CoreDNS metrics return sensitive information. A malicious container can contact the CoreDNS metrics service, get sensitive information, and leverage the leakage information to attack the CoreDNS service itself. For example, the CoreDNS metrics return coredns_forward_max_concurrent_rejects_total, malicious users can make CoreDNS requests reach the max of concurrent_queries by observing this information affecting the DNS queries of other containers in the cluster, resulting in DoS.
    2. DoS threat: If the CoreDNS pods run on the control plane node, a malicious container can send many requests to CoreDNS-SERVICE-IP:9153/metrics, making CoreDNS pods considerably consume the control plane resources.
  3. Node Exporter: Node Exporter exposes a node’s metrics on the 9100 TCP port. In a cluster installed Prometheus, a node_exporter pod exists on each node and exposes Node Exporter service by Kubernetes headless Service. The control plane node’s node_exporter can be accessed by containers in the cluster. It contains information leakage and DoS threat.

    1. Information leakage: A malicious container can send the request to the node_exporter on the control plane node and get the control plane’s sensitive information. For example, the node_exporter leaks the control plane’s nf_conntrack_count and the nf_conntrack_max, the nf_conntrack_count is a resource counter of nf_conntrack table, and the nf_conntrack_max is the quota of the nf_conntrack table. A malicious container can leverage this information to fill up the nf_conntrack table of the control plane node, and the control plane node drops packages randomly.
    2. DoS threat: A malicious container can send many requests, making the node_exporter significantly consume the control plane node resources. A malicious container can send HTTP GET requests, and the node_exporter consumes the control plane node 80% CPU resources on our local testbed.
  4. Grafana: Grafana uses 3000 TCP port to expose service metrics information. It contains a DoS threat.

    1. DoS threat: Similar to the former attacks, a malicious container can send multiple requests to Grafana-SERVICE-IP:3000/metrics and consumes 79% of CPU resources on our local test bed.
beorn7 commented 2 years ago

First of all, please follow the Prometheus security policy (also linked from SECURITY.md) when reporting vulnerabilities. That's particularly important if you are reporting actual security issues.

Luckily, nothing of what you report above is surprising. This is all by design. What's more the problem here is that a number of users aren't aware of the implications of exposing metrics via an HTTP endpoint (and similarly of exposing the query HTTP API of a Prometheus server). I have filed https://github.com/prometheus/docs/issues/2201 about improving the documentation here.