open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.9k stars 2.27k forks source link

feature gate -The default endpoints for all servers in components have changed to use localhost instead of 0.0.0.0. #34394

Closed sairamsadanala closed 3 days ago

sairamsadanala commented 1 month ago

Component(s)

extension/healthcheck

What happened?

Description

We have built an abstraction layer with Otel-coolector-contrib which is intermediate layer where all the otel collector sends telemetry and abstraction layer export to Splunk and Grafana endpoints. Abstraction layer is run on AWS ECS cluster which is load balanced via AWS NLB. This setup is automated using ADO pipeline with CloudFormation template.

Steps to Reproduce

Attached the config and

Expected Result

Up until V0.100.0 our ECS cluster for Abstraction layer run healthy and exports the telemetry to exporter endpoints.

Actual Result

With v0.106.1, AWS NLB target groups health checks are failing on port 13133 and rollbacking the cloudformation teamplate. it is working as expected for v0.100.0.

Collector version

v0.106.1

Environment information

Environment

OS: Amazon Linux

OpenTelemetry Collector configuration

extensions:
  health_check:
  pprof:
    endpoint: 0.0.0.0:1777
  zpages:
    endpoint: 0.0.0.0:55679
receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318
      grpc:
  awsecscontainermetrics:
    collection_interval: 30s    
  hostmetrics:
    collection_interval: 30s
    scrapers:
      cpu:
      memory:
      load:
        cpu_average: true
  hostmetrics/disk:
    collection_interval: 1m
    scrapers:
      disk:
      filesystem:
  splunk_hec/logs:
    endpoint: 0.0.0.0:8088
    access_token_passthrough: true
  splunk_hec/metrics:
    endpoint: 0.0.0.0:8087
    access_token_passthrough: true
processors:
  batch:
    send_batch_size: 12000
    timeout: 10s
    send_batch_max_size: 14000
  resourcedetection/general:
    detectors: [env,ecs,system,docker]
  attributes:
    actions:
      - action: insert
        key: loki.attribute.labels
        value: log.file.name
  resource:
    attributes:
      - action: insert
        key: loki.resource.labels
        value: cloud.account.id,cloud.availability_zone,cloud.platform,cloud.provider,cloud.region,host.id,host.name,host.type

exporters:
  splunk_hec/logs:
    endpoint: "https://XXXXXXXXX:xxxx/services/collector"
    token: "XXXXXX-XXXXXXXXXXXXXXXXXXxx"
    timeout: 30s
    index: devops
    sending_queue:
      enabled: true
      num_consumers: 60
      queue_size: 100000
    retry_on_failure:
      enabled: true
      initial_interval: 10s
      max_interval: 60s
      max_elapsed_time: 60s
    tls:
      insecure_skip_verify: false
      ca_file: "XXXX.pem"
      cert_file: "XXXX.pem"
      key_file: "XXX.key"
  splunk_hec/metrics:
    endpoint: "https://XXXXX/services/collector/event"
    token: "XXXXXXXXXX-XXXX"
    timeout: 30s
    index: "tooling_metrics"
    sending_queue:
      enabled: true
      num_consumers: 60
      queue_size: 100000
    retry_on_failure:
      enabled: true
      initial_interval: 10s
      max_interval: 60s
      max_elapsed_time: 60s
    tls:
      insecure_skip_verify: false
      ca_file: "XXX.pem"
      cert_file: "XXX.pem"
      key_file: "/etc/otel/splunk_dec.key"
  loki:
    endpoint: "https://XXXXX/loki/api/v1/push"
    tls:
        insecure: false
        insecure_skip_verify: true
    headers:
        "authorization": "Basic XXXXXXXXXXXXXXX="
  prometheusremotewrite:
    endpoint: https://XXXXX/mimir/api/v1/push
    tls:
        insecure: false
        insecure_skip_verify: true
    headers:
        "authorization": "Basic XXXXXXXXXXXXXXXXXXXXXXXX"
    external_labels:
        source: otalecsprd
    resource_to_telemetry_conversion: 
      enabled: true
  otlphttp:    
    endpoint: "https://XXXXX/tempo/otlp/"
    traces_endpoint: "https://xxxxxx/tempo/otlp/v1/traces"
    tls:      
        insecure: false
        insecure_skip_verify: true   
    headers:
        "authorization": "Basic XXXXXXXXXXXXXXXXXXXXXX"
service:
  extensions: [health_check, pprof, zpages]
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [XXX,XXX]
      processors: [batch,resourcedetection/general]
    metrics:
      receivers: [otlp,splunk_hec/incomingmetrics,awsecscontainermetrics]
      exporters: [splunk_hec/metrics,prometheusremotewrite]
      processors: [batch]
    metrics/internal:
      receivers: [hostmetrics,hostmetrics/disk]
      exporters: [splunk_hec/metrics,prometheusremotewrite]
      processors: [batch,resourcedetection/general]
    logs:
      receivers: [otlp]
      exporters: [splunk_hec/logs,loki]
      processors: [batch,attributes,resource]
    logs/splunk_hec:
      receivers: [splunk_hec/incominglogs]
      exporters: [splunk_hec/logs,loki]
      processors: [batch,attributes,resource]

Log output

XXXXXXXXXXXXXXXXXXXX:31:16.639Z        info    zpagesextension@v0.106.1/zpagesextension.go:76  Starting zPages extension       {"kind": "extension", "name": "zpages", "config": {"Endpoint":"0.0.0.0:55679","TLSSetting":null,"CORS":null,"Auth":null,"MaxRequestBodySize":0,"IncludeMetadata":false,"ResponseHeaders":null,"CompressionAlgorithms":null,"ReadTimeout":0,"ReadHeaderTimeout":0,"WriteTimeout":0,"IdleTimeout":0}}
XXXXXXXXXXXXXXXXXXXX:31:16.639Z        info    extensions/extensions.go:56     Extension started.      {"kind": "extension", "name": "zpages"}
XXXXXXXXXXXXXXXXXXXX:31:16.639Z        info    extensions/extensions.go:39     Extension is starting...        {"kind": "extension", "name": "pprof"}
XXXXXXXXXXXXXXXXXXXX:31:16.639Z        info    pprofextension@v0.106.1/pprofextension.go:60    Starting net/http/pprof server  {"kind": "extension", "name": "pprof", "config": {"TCPAddr":{"Endpoint":"0.0.0.0:1777","DialerConfig":{"Timeout":0}},"BlockProfileFraction":0,"MutexProfileFraction":0,"SaveToFile":""}}
XXXXXXXXXXXXXXXXXXXX:31:16.639Z        info    extensions/extensions.go:56     Extension started.      {"kind": "extension", "name": "pprof"}
XXXXXXXXXXXXXXXXXXXX:31:16.640Z        info    internal/resourcedetection.go:125       began detecting resource information    {"kind": "processor", "name": "resourcedetection/general", "pipeline": "traces"}
XXXXXXXXXXXXXXXXXXXX:31:16.650Z        warn    internal/resourcedetection.go:130       failed to detect resource       {"kind": "processor", "name": "resourcedetection/general", "pipeline": "traces", "error": "failed getting OS type: failed to fetch Docker OS type: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?"}
XXXXXXXXXXXXXXXXXXXX:31:16.650Z        info    internal/resourcedetection.go:139       detected resource information   {"kind": "processor", "name": "resourcedetection/general", "pipeline": "traces", "resource": {"aws.ecs.cluster.arn":"XXXXXXXXXXXXXXXXXXXX","aws.ecs.launchtype":"ec2","aws.ecs.task.arn":"arn:aws:ecs:XXXXXXXXXXXXXXXXXXXX","aws.ecs.task.family":"splunkhec-otelTDef","aws.ecs.task.id":"ee4c36a87b124aad9848542844239e77","aws.ecs.task.revision":"13","cloud.account.id":"XXXXXXXXXXXXXXXXXXXX","cloud.availability_zone":"eu-west-1b","cloud.platform":"aws_ecs","cloud.provider":"aws","cloud.region":"eu-west-1","host.name":"XXXXXXXXXXXXXXXXXXXX","os.type":"linux"}}
XXXXXXXXXXXXXXXXXXXX:31:16.650Z        info    otlpreceiver@v0.106.1/otlp.go:102       Starting GRPC server    {"kind": "receiver", "name": "otlp", "data_type": "traces", "endpoint": "localhost:4317"}
XXXXXXXXXXXXXXXXXXXX:31:16.650Z        info    otlpreceiver@v0.106.1/otlp.go:152       Starting HTTP server    {"kind": "receiver", "name": "otlp", "data_type": "traces", "endpoint": "0.0.0.0:4318"}
XXXXXXXXXXXXXXXXXXXX:31:16.651Z        info    healthcheck/handler.go:132      Health Check state change       {"kind": "extension", "name": "health_check", "status": "ready"}
XXXXXXXXXXXXXXXXXXXX:31:16.651Z        info    service@v0.106.1/service.go:225 Everything is ready. Begin running and processing data.
XXXXXXXXXXXXXXXXXXXX:31:16.651Z        info    localhostgate/featuregate.go:63 The default endpoints for all servers in components have changed to use localhost instead of 0.0.0.0. Disable the feature gate to temporarily revert to the previous default.  {"feature gate ID":

Additional context

We would like to understand what is this change translate? "localhostgate/featuregate.go:63 The default endpoints for all servers in components have changed to use localhost instead of 0.0.0.0. Disable the feature gate to temporarily revert to the previous default. {"feature gate ID": "component.UseLocalHostAsDefaultHost"}"

How do we disable this change from default to use localhost. Any document or steps are highly appreciated.

github-actions[bot] commented 1 month ago

Pinging code owners:

crobert-1 commented 1 month ago

Hello @sairamsadanala, thanks for filing this issue. As the message states, if you prefer to keep the default endpoint as 0.0.0.0 you can disable the component.UseLocalHostAsDefaultHost feature gate. Information can be found here on how to do disable feature gates.

For more information on the reasoning and context of this change, changing the default to be localhost instead of 0.0.0.0, please refer to this issue.

The best option is to be able to update your configuration to work with an endpoint other than 0.0.0.0 as pointed out in the linked issue, as it's a potential security risk.

sairamsadanala commented 1 month ago

Thanks Robert,

I am running otelcol-contrib on AWS ECS and building the image using Docker and pushing to ECR.Can you give me a sample example how to disable default to be localhost in docker build ?

ENTRYPOINT [ "/otelcol-contrib","--config=/etc/otel/config.yaml","--feature-gates=-<WHAT GATE NAME SHOULD I USE HERE>

On Fri, Aug 2, 2024 at 12:50 PM Curtis Robert @.***> wrote:

Hello @sairamsadanala https://github.com/sairamsadanala, thanks for filing this issue. As the message states, if you prefer to keep the default endpoint as 0.0.0.0 you can disable the component.UseLocalHostAsDefaultHost feature gate. Information can be found here https://github.com/open-telemetry/opentelemetry-collector/blob/main/featuregate/README.md on how to do disable feature gates.

For more information on the reasoning and context of this change, changing the default to be localhost instead of 0.0.0.0, please refer to this issue https://github.com/open-telemetry/opentelemetry-collector/issues/8510.

The best option is to be able to update your configuration to work with an endpoint other than 0.0.0.0 as pointed out in the linked issue, as it's a potential security risk.

— Reply to this email directly, view it on GitHub https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/34394#issuecomment-2265871591, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKRJ4IZXSZSFFADJIHR6BOTZPPBEXAVCNFSM6AAAAABL46FLY6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENRVHA3TCNJZGE . You are receiving this because you were mentioned.Message ID: <open-telemetry/opentelemetry-collector-contrib/issues/34394/2265871591@ github.com>

crobert-1 commented 1 month ago

The feature gate name is component.UseLocalHostAsDefaultHost 👍

sairamsadanala commented 1 month ago

Is this what you are referring to?

ENTRYPOINT [ "/otelcol-contrib","--config=/etc/otel/config.yaml","--feature-gates=- component.UseLocalHostAsDefaultHost

On Mon, Aug 12, 2024 at 10:18 AM Curtis Robert @.***> wrote:

The feature gate name is component.UseLocalHostAsDefaultHost 👍

— Reply to this email directly, view it on GitHub https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/34394#issuecomment-2284261932, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKRJ4IYS5RC2VQQIOYFROA3ZRDG5HAVCNFSM6AAAAABL46FLY6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOBUGI3DCOJTGI . You are receiving this because you were mentioned.Message ID: <open-telemetry/opentelemetry-collector-contrib/issues/34394/2284261932@ github.com>

crobert-1 commented 1 month ago

Right, I believe that should work.

jpkrohling commented 1 month ago

The real solution though is to set your health check extension to use 0.0.0.0 (or NodeIP) instead:

  health_check:
    endpoint: 0.0.0.0:13133
WamBamBoozle commented 2 weeks ago

thanks @jpkrohling -- that was the answer that was eluding me

Except: the real solution is that that be the default, as I was using the default which was leading to this error

jpkrohling commented 2 weeks ago

the real solution is that that be the default

We consciously moved from the default "0.0.0.0" to "localhost".

Mathiasdm commented 1 week ago

Having 'localhost' as a default is sensible security-wise.

What I did not expect was that, even if I explicitly specify '0.0.0.0', it's still changed to localhost. I would expect this to only happen if I did not specify anything (hence 'default').

Example config:

receivers:
    otlp:
        protocols:
            grpc:
                endpoint: 0.0.0.0:4317
            http:
                endpoint: 0.0.0.0:4318

Wouldn't it make more sense to only change the endpoint to localhost in case of:

receivers:
    otlp:
        protocols:
            grpc:
            http:
jpkrohling commented 6 days ago

I agree with you, and I just tested on v0.108.0 and it works as expected:

receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318

Logs:

2024-09-06T15:06:21.023+0200    info    otlpreceiver@v0.108.1/otlp.go:153       Starting HTTP server    {"kind": "receiver", "name": "otlp", "data_type": "traces", "endpoint": "0.0.0.0:4318"}
Mathiasdm commented 3 days ago

Well, that's surprising, my previous test last week didn't seem to work, but it does work without adapting the feature gate now. I must have made a mistake last time.

I was also doing my tests on 0.108.0.

Please ignore my previous message.

jpkrohling commented 3 days ago

I'm closing this issue for now, but please reopen it if we are still missing something.