open-telemetry / opentelemetry-collector

OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
4.43k stars 1.46k forks source link

Why is 0.0.0.0 the default host? #8510

Closed yurishkuro closed 4 months ago

yurishkuro commented 1 year ago

I am running OTLP receivers with default settings:

receivers:
  otlp:
    protocols:
      grpc:
      http:

The start up logs are littered with these warnings:

internal@v0.85.0/warning.go:40 Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks {"kind": "receiver", "name": "otlp", "data_type": "traces", "documentation": "https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks"}

Why isn't the default config for those exporters is such that it follows the mentioned best practices?

mx-psi commented 1 year ago

Previous discussion can be found on #6151. If I recall correctly ultimately we decided on a SIG meeting around that time not to change the default for backwards compatibility reasons.

mx-psi commented 1 year ago

If we were to change it, what default value would you suggest? localhost? Another problem here is that there is no good default value that works in all cases. Specifically, the containerized cases usually depend on the network configuration, we are able to provide a good default on the Helm chart but not in other cases.

yurishkuro commented 1 year ago

If we believe the default value is unsafe, it should be changed (despite backwards compatibility concerns). If it is safe (and appropriate in some scenarios) it should not log the warning.

The current situation is a bad user experience.

atoulme commented 1 year ago

I believe 127.0.0.1 should be default. There is no good reason to have warnings show up on the default setting. We should have an info message indicating we're using 127.0.0.1. For our distros and Docker image, the default config should have a way to override the default by using an environment variable. That should be documented. The env var should be a well known name that we standardize on.

mx-psi commented 1 year ago

the default config should have a way to override the default by using an environment variable

We don't support this (maybe we should), see #4384 for a related feature request.

mx-psi commented 1 year ago

Filed open-telemetry/opentelemetry-collector-releases#408 so we can address the Docker case independently from the general default.

A quick search on contrib reveals at least the following components in that repository are affected:

But there may be others, e.g. the sapm receiver which sets the endpoint to :7276, also a valid syntax for an unspecified address.

Aneurysm9 commented 1 year ago

the default config should have a way to override the default by using an environment variable

We don't support this (maybe we should), see #4384 for a related feature request.

While I think that would be a nice addition to the env confmap provider, I don't think it's necessary (or sufficient) here. All default configuration can be overridden by explicit configuration and all explicit configuration can be provided through the environment. Allowing fallbacks to be specified for the env provider would still require changes to the input configuration to specify that those fields should come from that provider and would need to specify the fallback at point of use.

Making some configuration fields take input from the environment in the absence of explicit configuration would be a much more significant change and one that I worry would make it even more difficult to reason about the effective configuration that would result from any given collector invocation.

yurishkuro commented 1 year ago

I agree, fallback env var seems like an overkill, and a slippery slope. Hopefully most people manage configurations as code / programmatically, so they can always do this themselves by reusing some base var={desired-ip} in their configs.

This issue is about the warning, we can either change the default or remove the warning. I personally would be fine with just removing the warning, because 0.0.0.0 is a reasonable default in many cases (e.g. :8080 string in Go means 0.0.0.0:8080 and nobody bats an eye at that).

mx-psi commented 1 year ago

For context (see issue linked on my first message) the main rationale for the warning was that this is a known security weakness (CWE-1327) and that we had a specific vulnerability (https://github.com/advisories/GHSA-69cg-p879-7622) where the current default made the security implications worse than they could have been.

Since this is ultimately a security-related decision I would be interested in what the @open-telemetry/sig-security-maintainers think.

yurishkuro commented 1 year ago

https://github.com/advisories/GHSA-69cg-p879-7622 doesn't look like it's related at all: "HTTP/2 connection can hang during closing if shutdown were preempted by a fatal error"

CWE-1327 - just because it exists doesn't make it a valid one. Starting a server that does not require TLS is also not the best practice, but it's important for dev usability. 0.0.0.0 doesn't seem any different. Do we log a warning when starting servers with Insecure settings?

atoulme commented 1 year ago

CWE-1327 - just because it exists doesn't make it a valid one. Starting a server that does not require TLS is also not the best practice, but it's important for dev usability. 0.0.0.0 doesn't seem any different. Do we log a warning when starting servers with Insecure settings?

We could and should :)

mx-psi commented 1 year ago

https://github.com/advisories/GHSA-69cg-p879-7622 doesn't look like it's related at all: "HTTP/2 connection can hang during closing if shutdown were preempted by a fatal error"

This is described in more detail in #6151 and was discussed on the SIG meeting at the time. Copying from the issue:

While some level of impact was unavoidable on this receiver since exposing an HTTP/2 server is part of its core functionality, the OTLP receiver default endpoints (0.0.0.0:4317 and 0.0.0.0:4318) made it so that this vulnerability could be leveraged more easily than it could have been.

So the relationship is that because of binding to all network interfaces the vulnerability was easier to exploit.

CWE-1327 - just because it exists doesn't make it a valid one.

The fact that is listed as CWE should count as some evidence that this is important; it's fair to argue against it but you should provide equivalent evidence against it.

Starting a server that does not require TLS is also not the best practice, but it's important for dev usability. 0.0.0.0 doesn't seem any different. Do we log a warning when starting servers with Insecure settings?

Actually this is a good case for doing something for 0.0.0.0. We enable TLS by default (see here), so I don't think that's a good argument against nudging users to be secure by default when it comes to choosing what network interfaces to bind to.

The equivalent to the approach we took for TLS would be to change the default to localhost. Here instead we chose to warn people about it and changing the defaults in downstream tools (e.g. the Helm chart) because of backwards compatibility concerns. The only difference is that we got to the TLS case sooner when we had fewer users and were making breaking changes more frequently.

jpkrohling commented 1 year ago

I think we warned users long enough already, we could indeed consider using localhost as the default host to bind the port to.

mx-psi commented 1 year ago

Inspired by https://github.com/open-telemetry/opentelemetry-collector/issues/7769#issuecomment-1601379447 I want to vote on what we should do here.

What should we do with this issue?

:rocket: Do nothing, leave the warning as is, don't change the default :heart: Change default to localhost on all receivers and remove the warning :tada: Change default to localhost on all receivers but don't remove the warning :eyes: Remove the warning, make no changes to the defaults

Vote for all options that are acceptable to you

codeboten commented 1 year ago

@mx-psi another option would be to make this a feature-gate to give people a chance to change over more smoothly

mx-psi commented 1 year ago

@mx-psi another option would be to make this a feature-gate to give people a chance to change over more smoothly

I'd say we can use the poll to decide what the end result should be and we can discuss the process to get there once we know that. Does that sound okay?

hughesjj commented 1 year ago

Well, if we're voting, so far "change default to localhost" is winning, leaning towards additionally removing the warning.

mx-psi commented 1 year ago

Well, if we're voting, so far "change default to localhost" is winning, leaning towards additionally removing the warning.

Indeed. I think doing this with a feature gate as @codeboten suggested makes sense. There is one slight inconvenience, which is that this feature gate must be shared across core and contrib modules. One way to work around this is to, on contrib, VisitAll gates in the global registry until getting one with the matching ID(). Another way is to have it as part of the public API of core. I will open a couple PRs with the first approach by the end of the week.

mx-psi commented 1 year ago

PTAL at #8622 as a first step, I am working on the contrib PR but would want to validate the design here first. I ended up exposing this as part of the public API since otherwise it is hard to reason about initialization and registration order.

bboreham commented 9 months ago

Hi, just wanted to suggest that you have somewhere in the docs a warning about the case where localhost gets resolved via DNS to some real address and things go very weird. This has happened enough times to me that I reflexively type 127.0.0.1 instead, but I recognize that ipv6 is a thing.

mx-psi commented 9 months ago

@bboreham moved this to #9338, thanks for the suggestion :)

yurishkuro commented 4 months ago

Odd. v0.104.0 release notes say:

otlpreceiver: Switch to localhost as the default for all endpoints

But when I upgrade Jaeger to this version I am still getting warnings:

2024-07-01T19:38:52.041-0400    info    otlpreceiver@v0.104.0/otlp.go:102   Starting GRPC server    {"kind": "receiver", "name": "otlp", "data_type": "traces", "endpoint": "0.0.0.0:4317"}
2024-07-01T19:38:52.041-0400    warn    internal@v0.104.0/warning.go:42 Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks. Enable the feature gate to change the default and remove this warning.    {"kind": "receiver", "name": "otlp", "data_type": "traces", "documentation": "https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks", "feature gate ID": "component.UseLocalHostAsDefaultHost"}
2024-07-01T19:38:52.041-0400    info    otlpreceiver@v0.104.0/otlp.go:152   Starting HTTP server    {"kind": "receiver", "name": "otlp", "data_type": "traces", "endpoint": "0.0.0.0:4318"}

we use default config

receivers:
  otlp:
    protocols:
      grpc:
      http:
TylerHelmuth commented 4 months ago

I haven't been able to reproduce this yet, the otelcorecol build in the repo produces:

❯ ./bin/otelcorecol_darwin_arm64 --config ./bin/config.yaml
2024-07-01T19:58:27.751-0600    info    service@v0.104.0/service.go:115 Setting up own telemetry...
2024-07-01T19:58:27.751-0600    info    service@v0.104.0/telemetry.go:96        Serving metrics {"address": ":8888", "level": "Normal"}
2024-07-01T19:58:27.752-0600    info    exporter@v0.104.0/exporter.go:280       Development component. May change in the future.        {"kind": "exporter", "data_type": "traces", "name": "debug"}
2024-07-01T19:58:27.753-0600    info    service@v0.104.0/service.go:193 Starting otelcorecol... {"Version": "0.104.0-dev", "NumCPU": 10}
2024-07-01T19:58:27.753-0600    info    extensions/extensions.go:34     Starting extensions...
2024-07-01T19:58:27.753-0600    info    otlpreceiver@v0.104.0/otlp.go:102       Starting GRPC server    {"kind": "receiver", "name": "otlp", "data_type": "traces", "endpoint": "localhost:4317"}
2024-07-01T19:58:27.757-0600    info    otlpreceiver@v0.104.0/otlp.go:152       Starting HTTP server    {"kind": "receiver", "name": "otlp", "data_type": "traces", "endpoint": "localhost:4318"}
2024-07-01T19:58:27.757-0600    info    service@v0.104.0/service.go:219 Everything is ready. Begin running and processing data.

with this config

receivers:
  otlp:
    protocols:
      grpc:
      http:

exporters:
  debug:

service:
  pipelines:
    traces:
      receivers:
        - otlp
      exporters:
        - debug

Also tested with otelcontribcol in contrib and the k8s distribution image.

All of those have main.gos that are generated by ocb and jaeger isn't, I wonder if that is related to the issue.

yurishkuro commented 4 months ago

is the feature gate UseLocalHostAsDefaultHost being set in main's? Maybe we could set it manually in Jaeger. I am not too familiar with feature gates - can I override their defaults while still letting the user provide their own overrides?

mx-psi commented 4 months ago

It seems like the issue is specific to how Jaeger uses the Collector APIs, but I can't tell exactly what's going wrong.

@yurishkuro

The default value is set here: https://github.com/open-telemetry/opentelemetry-collector/blob/2c15bfafe191afb5a9f7c087061be546d4506324/internal/localhostgate/featuregate.go#L23

The override is passed here: https://github.com/open-telemetry/opentelemetry-collector/blob/2c15bfafe191afb5a9f7c087061be546d4506324/otelcol/command.go#L22

and eventually Cobra calls the Set method on the flag which then sets the flag using the registry Set value

This should work the same on Jaeger (you are calling NewCommand here).

If you want to override the value on Jaeger using Go code you can do featuregate.GlobalRegistry().Set(id, val). It also seems like the --feature-gates CLI flag should work (I haven't tested that)

yurishkuro commented 4 months ago

@mx-psi I figured it out - turns out I was trying with v104 of collector but still v103 of contrib. And it turns out the same feature is defined in both core and contrib, with the same ID, and whichever one wins depends on module loading order.

marcalff commented 4 months ago

The release notes for the collector:

only mention this issue, but not the PR number that changes the endpoint to localhost by default.

A couple of questions:

mx-psi commented 4 months ago

@yurishkuro Ah, right, that makes sense indeed. I would argue we don't support using versions that are not aligned of core and contrib, but for future feature gates it may make sense to do this differently.

@marcalff

could anyone link the associated PR ?

The PRs that should have closed this are #10352 and open-telemetry/opentelemetry-collector-contrib/pull/33658.

There is also a long tail of older PRs working on this, see #6267, #8622, #6959, all the PRs linked on open-telemetry/opentelemetry-collector-contrib#30702, and the updates to other projects, default configs or docs made by open-telemetry/opentelemetry.io/pull/3847, open-telemetry/opentelemetry-helm-charts/pull/1012 and open-telemetry/opentelemetry-collector-releases/pull/408.

can this issue be closed then ?

Yes! Closing :)