Open jkroepke opened 1 year ago
👋 Since the default listen address is 127.0.0.1:12345
, it requires direct access to the machine where the agent is installed to be able to access the API/UIs; it isn't exposed over the network by default. Are you concerned about the case where an attacker has direct access to the machine?
I'm not completely opposed to this idea, but even Prometheus also doesn't support basic auth directly, and I'm assuming their security recommendation is to rely on localhost and limit physical access to machines. (Edit: this is wrong; see below).
So, from my perspective, the answer isn't "no" right now, but I want to understand more about what attack vectors we're trying to prevent, and if there's something else already planned that also takes care of those vectors.
A https listener for the grafana-agent might be also a good solution to add full protection.
This is something available in static mode, but not Flow mode yet (#2715). Before we can add HTTPS support (or Basic Auth), grafana/agent#2984 is required as a prerequisite so we don't break wiring prometheus.scrape
to any prometheus.exporter
component.
I'm not completely opposed to this idea, but even Prometheus also doesn't support basic auth directly, and I'm assuming their security recommendation is to rely on localhost and limit physical access to machines.
Sorry, it looks like I'm wrong about this; Prometheus does support basic auth. It was added in prometheus/prometheus#8316 and is documented on their website.
I'm leaning more towards this being OK functionality to support now, but grafana/agent#2984 is still a prerequisite and basic auth must be done in a way that doesn't prevent the in-memory scraping from working.
However, if TLS/mTLS would be good enough, I'd be interested in implementing that first since Flow needs it for feature parity with the scraping service.
I have some time on a flight so I'll see how much progress I can make on grafana/agent#2984 to unblock the other tasks.
👋 Since the default listen address is
127.0.0.1:12345
, it requires direct access to the machine where the agent is installed to be able to access the API/UIs; it isn't exposed over the network by default. Are you concerned about the case where an attacker has direct access to the machine?
I plan to have grafana agent installed on each machine to push metrics from a local machine. Unlike linux systems, windows systems doesn't have any network namespaces. 127.0.0.1:12345
is not exposed by network, but available on each system.
In some of setups, we are running internet facing applications on Windows. There systems are not internet facing directly (some LoadBalancer and WAFs are in between), but in theory if an attacker that has access to an vulnerability of an application, he might also have access to the Agent.
On Kubernetes, I can mitigate such scenarios by applying Network Policies.
Prometheus, windows_exporter, node_exporter and blackbox_export includes https://github.com/prometheus/exporter-toolkit and provides Auth Basic and TLS/mTLS to the exporters and to prometheus itself as you mentioned to mitigate such scenarios. Prometheus supports to scrape auth basic targets. But unlike the scheme or metric path, there is no way to configure auth basic on a scrape target through relabeling (https://github.com/prometheus/prometheus/issues/2614).
mTLS is sufficient, since client certificates my take the auth part here. I have no clue, how complicated it would be on user side? Reading certificates from file system and pass them to all prometheus.scrape
components automatically?
After reading grafana/agent#2984 grafana/agent#1509, can I assume that if in-memory scrape is available for flow, there is no need to open a port on 12345 anymore? If so, that would also solve this issue.
I'm guess that the embedded exporters and the API/UI are sharing the same server at them moment?
After reading https://github.com/grafana/agent/issues/2984 https://github.com/grafana/agent/pull/1509, can I assume that if in-memory scrape is available for flow, there is no need to open a port on 12345 anymore? If so, that would also solve this issue.
Right, with grafana/agent#3602 it is hypothetically possible to disable the network stack and still have all of the components work as expected. However, you would lose access to the UI for debugging.
For people who wanted to retain the UI for debugging, we would want to support mTLS or basic auth as proposed here.
Is there something on the UI which is not available to logs? Is it possible to make the UI available on a dedicated http server? Then auth basic could be applied on the UI much easier, since it would not affect the exporters anymore.
Did you consider support a UNIX socket (as configurable option) as alternative to an TCP port?
You may not expect this, but Windows supports AF_UNIX and go implement this, too. It's a Windows 10+ feature, but Go 1.21 requires Windows 10 anyways.
Is there something on the UI which is not available to logs? Is it possible to make the UI available on a dedicated http server? Then auth basic could be applied on the UI much easier, since it would not affect the exporters anymore.
Logs can definitely take you a long way, but:
Did you consider support a UNIX socket (as configurable option) as alternative to an TCP port?
Yes, but UNIX sockets still leave an attack vector for people with physical access to the machine. Also, Prometheus doesn't support scraping metrics over UNIX sockets, and they weren't willing to change their mind on that last time I asked.
I saw the issues around unix sockets at prometheus project. I through that might be not an issue any more, if grafana/agent#3602 is merged.
I understand that grafana/agent#3602 + mTLS might be the closest option to solve this.
Regarding the flow agent, wouldn't it be desireable to completely disable the server with some configurations (e.g., https://github.com/grafana/agent/pull/3953)? Is there a build option to do this? Even if builtinassets
is not in GO_TAGS
, the server is still running.
I see that the server is useful for debugging, however, in production I perceive it as a security risk. The river configuration can include files and secrets that are not readable to everyone on the machine. Having the final configuration served in the UI exposes these in an unexpected manner to unprivileged processes by arbitrary users.
Further, this does not even account for the fact that this is a privileged process. If there is a bug in the HTTP server couldn't this lead to privilege escalation?
I see that the server is useful for debugging, however, in production I perceive it as a security risk.
I would agree on this. Currently we evaluating grafana agent on Virtual Desktop instance where 127.0.0.1 is more reachable for end-users.
Now that prometheus.exporter components can properly function without round-tripping network access, I would be open to a PR to add a flag which disables the HTTP server.
However, disabling the HTTP server would also prevent clustering from working; turning off the HTTP server would only be desirable if agents are small.
Another option could be to have a flag to disable the UI and its API endpoints, so agents could still use clustering to distribute work.
Maybe flags for both?
In terms of protecting the server with basic auth (or TLS), I would want these settings to be configurable at runtime in the config file. The nature of Flow makes that a bit more difficult, but I have an incoming proposal in the next few weeks which would hopefully enable that use case.
Since we have introduce 1 agent for each single machine, clustering is not the use-case for us. Each agent runs on the own machine.
Another option could be to have a flag to disable the UI and its API endpoints, so agents could still use clustering to distribute work.
With API endpoint, did you include the metrics endpoints of the embedded exporters, too?
What about having dedicated listener for clustering, which can be optionally enabled on-demand? Looking to alertmanager or grafana itself, they always having a dedicated port go the gossip discovery.
Use Port 12345 for the API and UI only and keep the exporter on the in memory listener exclusivly.
With API endpoint, did you include the metrics endpoints of the embedded exporters, too?
I really just meant the API endpoints used for the UI. Disabling the metrics endpoints and embedded exporters endpoints sounds more like just disabling the entire HTTP listener.
What about having dedicated listener for clustering, which can be optionally enabled on-demand? Looking to alertmanager or grafana itself, they always having a dedicated port go the gossip discovery.
This is something we do in static mode, but we wanted to explore just using a single listener for all relevant traffic to reduce the total number of configuration parameters.
Maybe this is something we'll backpedal on in the future, but for now I'd like to consider it a feature that there's only one network address you need to expose for all core functionality (i.e., functionality that's not driven by a component).
Use Port 12345 for the API and UI only and keep the exporter on the in memory listener exclusivly.
Hm, interesting. That's something we can explore, but I'm worried that might harm the debugging process, since it's really useful to be able to hit up the metrics endpoint for an exporter component to see what's going on.
Hi,
After working some time with grafana-agent, I identified that all exporter modules (at least in flow mode) are exposing the metrics endpoint trough the internal grafana-agent http server.
I would like to ask if its possible to protect the APIs (and UI) by Auth Basic. This may protects unauthorized access from other contexts on the machine, if grafana-agent is natively installed on a system. If I install grafana-agent on each machine, I have some worries that a potential pen test could raise this issue.
Auth Basic over HTTP is insecure, but more secure compared to no authentication. A https listener for the grafana-agent might be also a good solution to add full protection.