open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.12k stars 2.39k forks source link

feature gate -The default endpoints for all servers in components have changed to use localhost instead of 0.0.0.0. #34394

Closed sairamsadanala closed 2 months ago

sairamsadanala commented 3 months ago

Component(s)

extension/healthcheck

What happened?

Description

We have built an abstraction layer with Otel-coolector-contrib which is intermediate layer where all the otel collector sends telemetry and abstraction layer export to Splunk and Grafana endpoints. Abstraction layer is run on AWS ECS cluster which is load balanced via AWS NLB. This setup is automated using ADO pipeline with CloudFormation template.

Steps to Reproduce

Attached the config and

Expected Result

Up until V0.100.0 our ECS cluster for Abstraction layer run healthy and exports the telemetry to exporter endpoints.

Actual Result

With v0.106.1, AWS NLB target groups health checks are failing on port 13133 and rollbacking the cloudformation teamplate. it is working as expected for v0.100.0.

Collector version

v0.106.1

Environment information

Environment

OS: Amazon Linux

OpenTelemetry Collector configuration

extensions:
  health_check:
  pprof:
    endpoint: 0.0.0.0:1777
  zpages:
    endpoint: 0.0.0.0:55679
receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318
      grpc:
  awsecscontainermetrics:
    collection_interval: 30s    
  hostmetrics:
    collection_interval: 30s
    scrapers:
      cpu:
      memory:
      load:
        cpu_average: true
  hostmetrics/disk:
    collection_interval: 1m
    scrapers:
      disk:
      filesystem:
  splunk_hec/logs:
    endpoint: 0.0.0.0:8088
    access_token_passthrough: true
  splunk_hec/metrics:
    endpoint: 0.0.0.0:8087
    access_token_passthrough: true
processors:
  batch:
    send_batch_size: 12000
    timeout: 10s
    send_batch_max_size: 14000
  resourcedetection/general:
    detectors: [env,ecs,system,docker]
  attributes:
    actions:
      - action: insert
        key: loki.attribute.labels
        value: log.file.name
  resource:
    attributes:
      - action: insert
        key: loki.resource.labels
        value: cloud.account.id,cloud.availability_zone,cloud.platform,cloud.provider,cloud.region,host.id,host.name,host.type

exporters:
  splunk_hec/logs:
    endpoint: "https://XXXXXXXXX:xxxx/services/collector"
    token: "XXXXXX-XXXXXXXXXXXXXXXXXXxx"
    timeout: 30s
    index: devops
    sending_queue:
      enabled: true
      num_consumers: 60
      queue_size: 100000
    retry_on_failure:
      enabled: true
      initial_interval: 10s
      max_interval: 60s
      max_elapsed_time: 60s
    tls:
      insecure_skip_verify: false
      ca_file: "XXXX.pem"
      cert_file: "XXXX.pem"
      key_file: "XXX.key"
  splunk_hec/metrics:
    endpoint: "https://XXXXX/services/collector/event"
    token: "XXXXXXXXXX-XXXX"
    timeout: 30s
    index: "tooling_metrics"
    sending_queue:
      enabled: true
      num_consumers: 60
      queue_size: 100000
    retry_on_failure:
      enabled: true
      initial_interval: 10s
      max_interval: 60s
      max_elapsed_time: 60s
    tls:
      insecure_skip_verify: false
      ca_file: "XXX.pem"
      cert_file: "XXX.pem"
      key_file: "/etc/otel/splunk_dec.key"
  loki:
    endpoint: "https://XXXXX/loki/api/v1/push"
    tls:
        insecure: false
        insecure_skip_verify: true
    headers:
        "authorization": "Basic XXXXXXXXXXXXXXX="
  prometheusremotewrite:
    endpoint: https://XXXXX/mimir/api/v1/push
    tls:
        insecure: false
        insecure_skip_verify: true
    headers:
        "authorization": "Basic XXXXXXXXXXXXXXXXXXXXXXXX"
    external_labels:
        source: otalecsprd
    resource_to_telemetry_conversion: 
      enabled: true
  otlphttp:    
    endpoint: "https://XXXXX/tempo/otlp/"
    traces_endpoint: "https://xxxxxx/tempo/otlp/v1/traces"
    tls:      
        insecure: false
        insecure_skip_verify: true   
    headers:
        "authorization": "Basic XXXXXXXXXXXXXXXXXXXXXX"
service:
  extensions: [health_check, pprof, zpages]
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [XXX,XXX]
      processors: [batch,resourcedetection/general]
    metrics:
      receivers: [otlp,splunk_hec/incomingmetrics,awsecscontainermetrics]
      exporters: [splunk_hec/metrics,prometheusremotewrite]
      processors: [batch]
    metrics/internal:
      receivers: [hostmetrics,hostmetrics/disk]
      exporters: [splunk_hec/metrics,prometheusremotewrite]
      processors: [batch,resourcedetection/general]
    logs:
      receivers: [otlp]
      exporters: [splunk_hec/logs,loki]
      processors: [batch,attributes,resource]
    logs/splunk_hec:
      receivers: [splunk_hec/incominglogs]
      exporters: [splunk_hec/logs,loki]
      processors: [batch,attributes,resource]

Log output

XXXXXXXXXXXXXXXXXXXX:31:16.639Z        info    zpagesextension@v0.106.1/zpagesextension.go:76  Starting zPages extension       {"kind": "extension", "name": "zpages", "config": {"Endpoint":"0.0.0.0:55679","TLSSetting":null,"CORS":null,"Auth":null,"MaxRequestBodySize":0,"IncludeMetadata":false,"ResponseHeaders":null,"CompressionAlgorithms":null,"ReadTimeout":0,"ReadHeaderTimeout":0,"WriteTimeout":0,"IdleTimeout":0}}
XXXXXXXXXXXXXXXXXXXX:31:16.639Z        info    extensions/extensions.go:56     Extension started.      {"kind": "extension", "name": "zpages"}
XXXXXXXXXXXXXXXXXXXX:31:16.639Z        info    extensions/extensions.go:39     Extension is starting...        {"kind": "extension", "name": "pprof"}
XXXXXXXXXXXXXXXXXXXX:31:16.639Z        info    pprofextension@v0.106.1/pprofextension.go:60    Starting net/http/pprof server  {"kind": "extension", "name": "pprof", "config": {"TCPAddr":{"Endpoint":"0.0.0.0:1777","DialerConfig":{"Timeout":0}},"BlockProfileFraction":0,"MutexProfileFraction":0,"SaveToFile":""}}
XXXXXXXXXXXXXXXXXXXX:31:16.639Z        info    extensions/extensions.go:56     Extension started.      {"kind": "extension", "name": "pprof"}
XXXXXXXXXXXXXXXXXXXX:31:16.640Z        info    internal/resourcedetection.go:125       began detecting resource information    {"kind": "processor", "name": "resourcedetection/general", "pipeline": "traces"}
XXXXXXXXXXXXXXXXXXXX:31:16.650Z        warn    internal/resourcedetection.go:130       failed to detect resource       {"kind": "processor", "name": "resourcedetection/general", "pipeline": "traces", "error": "failed getting OS type: failed to fetch Docker OS type: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?"}
XXXXXXXXXXXXXXXXXXXX:31:16.650Z        info    internal/resourcedetection.go:139       detected resource information   {"kind": "processor", "name": "resourcedetection/general", "pipeline": "traces", "resource": {"aws.ecs.cluster.arn":"XXXXXXXXXXXXXXXXXXXX","aws.ecs.launchtype":"ec2","aws.ecs.task.arn":"arn:aws:ecs:XXXXXXXXXXXXXXXXXXXX","aws.ecs.task.family":"splunkhec-otelTDef","aws.ecs.task.id":"ee4c36a87b124aad9848542844239e77","aws.ecs.task.revision":"13","cloud.account.id":"XXXXXXXXXXXXXXXXXXXX","cloud.availability_zone":"eu-west-1b","cloud.platform":"aws_ecs","cloud.provider":"aws","cloud.region":"eu-west-1","host.name":"XXXXXXXXXXXXXXXXXXXX","os.type":"linux"}}
XXXXXXXXXXXXXXXXXXXX:31:16.650Z        info    otlpreceiver@v0.106.1/otlp.go:102       Starting GRPC server    {"kind": "receiver", "name": "otlp", "data_type": "traces", "endpoint": "localhost:4317"}
XXXXXXXXXXXXXXXXXXXX:31:16.650Z        info    otlpreceiver@v0.106.1/otlp.go:152       Starting HTTP server    {"kind": "receiver", "name": "otlp", "data_type": "traces", "endpoint": "0.0.0.0:4318"}
XXXXXXXXXXXXXXXXXXXX:31:16.651Z        info    healthcheck/handler.go:132      Health Check state change       {"kind": "extension", "name": "health_check", "status": "ready"}
XXXXXXXXXXXXXXXXXXXX:31:16.651Z        info    service@v0.106.1/service.go:225 Everything is ready. Begin running and processing data.
XXXXXXXXXXXXXXXXXXXX:31:16.651Z        info    localhostgate/featuregate.go:63 The default endpoints for all servers in components have changed to use localhost instead of 0.0.0.0. Disable the feature gate to temporarily revert to the previous default.  {"feature gate ID":

Additional context

We would like to understand what is this change translate? "localhostgate/featuregate.go:63 The default endpoints for all servers in components have changed to use localhost instead of 0.0.0.0. Disable the feature gate to temporarily revert to the previous default. {"feature gate ID": "component.UseLocalHostAsDefaultHost"}"

How do we disable this change from default to use localhost. Any document or steps are highly appreciated.

github-actions[bot] commented 3 months ago

Pinging code owners:

crobert-1 commented 3 months ago

Hello @sairamsadanala, thanks for filing this issue. As the message states, if you prefer to keep the default endpoint as 0.0.0.0 you can disable the component.UseLocalHostAsDefaultHost feature gate. Information can be found here on how to do disable feature gates.

For more information on the reasoning and context of this change, changing the default to be localhost instead of 0.0.0.0, please refer to this issue.

The best option is to be able to update your configuration to work with an endpoint other than 0.0.0.0 as pointed out in the linked issue, as it's a potential security risk.

sairamsadanala commented 3 months ago

Thanks Robert,

I am running otelcol-contrib on AWS ECS and building the image using Docker and pushing to ECR.Can you give me a sample example how to disable default to be localhost in docker build ?

ENTRYPOINT [ "/otelcol-contrib","--config=/etc/otel/config.yaml","--feature-gates=-<WHAT GATE NAME SHOULD I USE HERE>

On Fri, Aug 2, 2024 at 12:50 PM Curtis Robert @.***> wrote:

Hello @sairamsadanala https://github.com/sairamsadanala, thanks for filing this issue. As the message states, if you prefer to keep the default endpoint as 0.0.0.0 you can disable the component.UseLocalHostAsDefaultHost feature gate. Information can be found here https://github.com/open-telemetry/opentelemetry-collector/blob/main/featuregate/README.md on how to do disable feature gates.

For more information on the reasoning and context of this change, changing the default to be localhost instead of 0.0.0.0, please refer to this issue https://github.com/open-telemetry/opentelemetry-collector/issues/8510.

The best option is to be able to update your configuration to work with an endpoint other than 0.0.0.0 as pointed out in the linked issue, as it's a potential security risk.

— Reply to this email directly, view it on GitHub https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/34394#issuecomment-2265871591, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKRJ4IZXSZSFFADJIHR6BOTZPPBEXAVCNFSM6AAAAABL46FLY6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENRVHA3TCNJZGE . You are receiving this because you were mentioned.Message ID: <open-telemetry/opentelemetry-collector-contrib/issues/34394/2265871591@ github.com>

crobert-1 commented 3 months ago

The feature gate name is component.UseLocalHostAsDefaultHost 👍

sairamsadanala commented 3 months ago

Is this what you are referring to?

ENTRYPOINT [ "/otelcol-contrib","--config=/etc/otel/config.yaml","--feature-gates=- component.UseLocalHostAsDefaultHost

On Mon, Aug 12, 2024 at 10:18 AM Curtis Robert @.***> wrote:

The feature gate name is component.UseLocalHostAsDefaultHost 👍

— Reply to this email directly, view it on GitHub https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/34394#issuecomment-2284261932, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKRJ4IYS5RC2VQQIOYFROA3ZRDG5HAVCNFSM6AAAAABL46FLY6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOBUGI3DCOJTGI . You are receiving this because you were mentioned.Message ID: <open-telemetry/opentelemetry-collector-contrib/issues/34394/2284261932@ github.com>

crobert-1 commented 3 months ago

Right, I believe that should work.

jpkrohling commented 3 months ago

The real solution though is to set your health check extension to use 0.0.0.0 (or NodeIP) instead:

  health_check:
    endpoint: 0.0.0.0:13133
WamBamBoozle commented 3 months ago

thanks @jpkrohling -- that was the answer that was eluding me

Except: the real solution is that that be the default, as I was using the default which was leading to this error

jpkrohling commented 3 months ago

the real solution is that that be the default

We consciously moved from the default "0.0.0.0" to "localhost".

Mathiasdm commented 2 months ago

Having 'localhost' as a default is sensible security-wise.

What I did not expect was that, even if I explicitly specify '0.0.0.0', it's still changed to localhost. I would expect this to only happen if I did not specify anything (hence 'default').

Example config:

receivers:
    otlp:
        protocols:
            grpc:
                endpoint: 0.0.0.0:4317
            http:
                endpoint: 0.0.0.0:4318

Wouldn't it make more sense to only change the endpoint to localhost in case of:

receivers:
    otlp:
        protocols:
            grpc:
            http:
jpkrohling commented 2 months ago

I agree with you, and I just tested on v0.108.0 and it works as expected:

receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318

Logs:

2024-09-06T15:06:21.023+0200    info    otlpreceiver@v0.108.1/otlp.go:153       Starting HTTP server    {"kind": "receiver", "name": "otlp", "data_type": "traces", "endpoint": "0.0.0.0:4318"}
Mathiasdm commented 2 months ago

Well, that's surprising, my previous test last week didn't seem to work, but it does work without adapting the feature gate now. I must have made a mistake last time.

I was also doing my tests on 0.108.0.

Please ignore my previous message.

jpkrohling commented 2 months ago

I'm closing this issue for now, but please reopen it if we are still missing something.

TRAD-Anthony-CKO commented 2 months ago

@jpkrohling I think this work for the OTLP endpoints indeed, but not the healthcheck extension (without disabling the featuregate). See below example tested on 109.0:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

exporters:
  otlp:
    endpoint: "${COLLECTOR_GATEWAY_ENDPOINT}"
    tls:
      insecure: true

processors:

extensions:
  health_check:
    endpoint: "0.0.0.0:13133"

service:
  extensions: [health_check]
  telemetry:
    logs:
      level: "debug"
  pipelines:
    metrics:
      receivers: [otlp]
      exporters: [otlp]
    logs:
      receivers: [otlp]
      exporters: [otlp]
    traces:
      receivers: [otlp]
      exporters: [otlp]

Collector logs:

2024-09-14T08:11:23.913Z        info    healthcheckextension@v0.106.1/healthcheckextension.go:32        Starting health_check extension {"kind": "extension", "name": "health_check", "config": {"Endpoint":"localhost:13133","TLSSetting":null,"CORS":null,"Auth":null,"MaxRequestBodySize":0,"IncludeMetadata":false,"ResponseHeaders":null,"CompressionAlgorithms":null,"ReadTimeout":0,"ReadHeaderTimeout":0,"WriteTimeout":0,"IdleTimeout":0,"Path":"/","ResponseBody":null,"CheckCollectorPipeline":{"Enabled":false,"Interval":"5m","ExporterFailureThreshold":5}}}
2024-09-14T08:11:23.914Z        info    extensions/extensions.go:56     Extension started.      {"kind": "extension", "name": "health_check"}
2024-09-14T08:11:23.914Z        info    zapgrpc/zapgrpc.go:176  [core] [Server #1]Server created        {"grpc_log": true}
2024-09-14T08:11:23.914Z        info    otlpreceiver@v0.106.1/otlp.go:102       Starting GRPC server    {"kind": "receiver", "name": "otlp", "data_type": "logs", "endpoint": "0.0.0.0:55680"}
2024-09-14T08:11:23.914Z        info    otlpreceiver@v0.106.1/otlp.go:152       Starting HTTP server    {"kind": "receiver", "name": "otlp", "data_type": "logs", "endpoint": "0.0.0.0:55681"}
2024-09-14T08:11:23.914Z        info    healthcheck/handler.go:132      Health Check state change       {"kind": "extension", "name": "health_check", "status": "ready"}
2024-09-14T08:11:23.914Z        info    service@v0.106.1/service.go:225 Everything is ready. Begin running and processing data.
2024-09-14T08:11:23.914Z        info    localhostgate/featuregate.go:63 The default endpoints for all servers in components have changed to use localhost instead of 0.0.0.0. Disable the feature gate to temporarily revert to the previous default.     {"feature gate ID": "component.UseLocalHostAsDefaultHost"}
2024-09-14T08:11:23.914Z        info    zapgrpc/zapgrpc.go:176  [core] [Server #1 ListenSocket #2]ListenSocket created  {"grpc_log": true}

Since the featuregate is planned to be removed in future releases, looking for a more long term solution here. Edit: Raised a new issue in case that behavior is new.