Closed XDTG closed 2 years ago
First of all, please follow the Prometheus security policy (also linked from SECURITY.md) when reporting vulnerabilities. That's particularly important if you are reporting actual security issues.
Luckily, nothing of what you report above is surprising. This is all by design. What's more the problem here is that a number of users aren't aware of the implications of exposing metrics via an HTTP endpoint (and similarly of exposing the query HTTP API of a Prometheus server). I have filed https://github.com/prometheus/docs/issues/2201 about improving the documentation here.
The metric mechanism of Prometheus has potential risks of information leakage and DoS attack. Prometheus collects metrics from targets by scraping metrics HTTP endpoints. However, these exposed HTTP endpoints will bring security risks of information leakage and DoS attacks to the targets. Below I will list some specific cases:
Calico: Calico maintains a calico-kube-controller pod on the Kubernetes control plane and exposes metrics on
9094
TCP port by a Kubernetes servicecalico-kube-controllers-metrics
. This service allows all containers in the entire cluster to visit by default. It contains information leakage and DoS threat.calico-kube-controller
include theIPAM
IP pool messages. These messages include Kubernetes node names and the number of IPs allocated to pods running on each node. As a result, a malicious container can infer the cluster topology by node names and the number of pods running on each node.calico-kube-controller
runs on the Kubernetes control plane and has no Linux control group limits by default. A malicious container can send a large number of requests in several seconds, making thecalico-kube-controller
greatly consume the control plane’s resources. In our experiment, a malicious container sends a large number of HTTPGET
requests by curl, making thecalico-kube-controller
consumes 80% of the entire control plane CPU resources.calico-kube-controller
. A malicious container can fill up the control plane’snf_conntrack
table, making the control plane node drop packages randomly. Linux kernel’s networking stack uses connection tracking to keep track of all logical network connections or flows, and the kernel maintains a table to record the detailed information of each connection. The total connections have a limit. Even though the containers on the control plane node are in different Linux network namespaces, all of their connections need to consume theinit_net.ct.count
of theinit_net
namespace of the control plane node. Therefore if one can generate a large number of TCP connections on the control plane node in a short time, it can consume all quota ofinit_net.ct.count
, causing the control plane’s Netfilter malfunction. In our experiment, a malicious container produces a large number of TCP connections and fills the control plane’snf_conntrack
table in a few minutes, which causes the control plane node to drop packages randomly.CoreDNS: The CoreDNS service uses the
9153
TCP port to expose service metrics information, which all containers in the Kubernetes cluster can access. It contains information leakage and DoS threat.coredns_forward_max_concurrent_rejects_total
, malicious users can make CoreDNS requests reach the max ofconcurrent_queries
by observing this information affecting the DNS queries of other containers in the cluster, resulting in DoS.CoreDNS-SERVICE-IP:9153/metrics
, making CoreDNS pods considerably consume the control plane resources.Node Exporter: Node Exporter exposes a node’s metrics on the
9100
TCP port. In a cluster installed Prometheus, anode_exporter
pod exists on each node and exposes Node Exporter service by Kubernetes headless Service. The control plane node’snode_exporter
can be accessed by containers in the cluster. It contains information leakage and DoS threat.node_exporter
on the control plane node and get the control plane’s sensitive information. For example, thenode_exporter
leaks the control plane’snf_conntrack_count
and thenf_conntrack_max
, thenf_conntrack_count
is a resource counter of nf_conntrack table, and thenf_conntrack_max
is the quota of thenf_conntrack
table. A malicious container can leverage this information to fill up thenf_conntrack
table of the control plane node, and the control plane node drops packages randomly.node_exporter
significantly consume the control plane node resources. A malicious container can send HTTP GET requests, and thenode_exporter
consumes the control plane node 80% CPU resources on our local testbed.Grafana: Grafana uses
3000
TCP port to expose service metrics information. It contains a DoS threat.Grafana-SERVICE-IP:3000/metrics
and consumes 79% of CPU resources on our local test bed.