Docker images tagged: 0.99.0 onward don't fully support the usage of a web proxy in an air-gapped environment

Docker images tagged: 0.99.0 onward don't support the usage of a web proxy in an air-gapped environment

We have an OTEL collector running in a Docker Swarm cluster ( using image: otel/opentelemetry-collector-contrib:0.112.0 ) within internal network which does not allow direct connectivity to public internet (air-gapped environment). The internal DNS used by the docker host does not resolve external names, this is for security purposes because the public names resolution is demanded to the company web proxy, we must go through that device to reach the internet for HTTP/HTTPS connections. In our OTEL collector configuration we have an exporter that points to a public SAAS provider (Coralogix - eu2.coralogix.com) with authentication via bearer token. As explained in the documentation, we used the HTTP_PROXY, HTTPS_PROXY and NO_PROXY environment variables to instruct the collector to use the company web proxy but it isn't working as expected in the versions after 0.98.0 . By starting the collector in debug mode (in the configurations) we noticed that the data arrives correctly to the collector, but the collector fails to export. The collector is not correctly using the web proxy to resolve public names and even if we let the container resolve the exporter still fails. After various troubleshooting attempts, we found that older versions support proxying ( latest version that still works is otel/opentelemetry-collector-contrib:0.98.0 ) and the connection to the SAAS endpoint.

Steps to reproduce

1) Collector deployed in internal network which does not allow outgoing traffic to backends in public internet. 2) Have an internal web proxy that can connect to the internet 3) Use a docker image of OTEL collector from tag 0.99.0 onwards 4) Use proxy configuration with HTTP_PROXY, HTTPS_PROXY (and optionally NO_PROXY) environment variables with the web proxy of point 2) at container starts. 5) Export the data to external network (external service)

What did you see instead?

We have tested different situations, these are the different outcomes: Version newer or equal than 0.99.0, configure web proxy (via ENV variables) and the container is NOT able to resolve the exporter endpoint (DNS does not resolve public names): "name resolver error: produced zero addresses"

pickfirst/pickfirst.go:122  [pick-first-lb] [pick-first-lb 0xc003a92ab0] Received error from the name resolver: produced zero addresses {"grpc_log": true}
grpc@v1.67.1/clientconn.go:544  [core] [Channel #2]Channel Connectivity change to TRANSIENT_FAILURE {"grpc_log": true}
grpcsync/callback_serializer.go:94  [core] error from balancer.UpdateClientConnState: bad resolver state    {"grpc_log": true}
internal/retry_sender.go:126    Exporting failed. Will retry the request after interval.    {"kind": "exporter", "data_type": "traces", "name": "*****", "error": "rpc error: code = Unavailable desc = name resolver error: produced zero addresses", "interval": "5.121492409s"}

Version newer or equal than 0.99.0, configure web proxy (via ENV variables) and the container is ABLE to resolve the exporter endpoint: "authentication handshake failed: EOF"

grpc@v1.67.1/resolver_wrapper.go:200    [core] [Channel #1]Resolver state updated: {
  "Addresses": [
    {
      "Addr": "16.170.111.131:443",
      "ServerName": "",
      "Attributes": null,
      "BalancerAttributes": null,
      "Metadata": null
    }
  ],
  "Endpoints": [
    {
      "Addresses": [
        {
          "Addr": "16.170.111.131:443",
          "ServerName": "",
          "Attributes": null,
          "BalancerAttributes": null,
          "Metadata": null
        }
      ],
      "Attributes": null
    }
  ],
  "ServiceConfig": null,
  "Attributes": null
} ()    {"grpc_log": true}

...
...
...

grpc@v1.67.1/clientconn.go:544  [core] [Channel #1]Channel Connectivity change to CONNECTING    {"grpc_log": true}
grpc@v1.67.1/clientconn.go:1199 [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to CONNECTING   {"grpc_log": true}
grpc@v1.67.1/clientconn.go:1317 [core] [Channel #1 SubChannel #2]Subchannel picks a new address "16.170.111.131:443" to connect {"grpc_log": true}
pickfirst/pickfirst.go:176  [pick-first-lb] [pick-first-lb 0xc0030a0990] Received SubConn state update: 0xc0030a0a20, {ConnectivityState:CONNECTING ConnectionError:<nil> connectedAddress:{Addr: ServerName: Attributes:<nil> BalancerAttributes:<nil> Metadata:<nil>}}    {"grpc_log": true}
grpc@v1.67.1/clientconn.go:1319 [core] Creating new client transport to "{Addr: \"16.170.111.131:443\", ServerName: \"ingress.eu2.coralogix.com:443\", }": connection error: desc = "transport: authentication handshake failed: EOF"   {"grpc_log": true}
grpc@v1.67.1/clientconn.go:1379 [core] [Channel #1 SubChannel #2]grpc: addrConn.createTransport failed to connect to {Addr: "16.170.111.131:443", ServerName: "ingress.eu2.coralogix.com:443", }. Err: connection error: desc = "transport: authentication handshake failed: EOF"   {"grpc_log": true}
grpc@v1.67.1/clientconn.go:1201 [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: authentication handshake failed: EOF"    {"grpc_log": true}
pickfirst/pickfirst.go:176  [pick-first-lb] [pick-first-lb 0xc0030a0990] Received SubConn state update: 0xc0030a0a20, {ConnectivityState:TRANSIENT_FAILURE ConnectionError:connection error: desc = "transport: authentication handshake failed: EOF" connectedAddress:{Addr: ServerName: Attributes:<nil> BalancerAttributes:<nil> Metadata:<nil>}}    {"grpc_log": true}
grpc@v1.67.1/clientconn.go:544  [core] [Channel #1]Channel Connectivity change to TRANSIENT_FAILURE {"grpc_log": true}
internal/retry_sender.go:126    Exporting failed. Will retry the request after interval.    {"kind": "exporter", "data_type": "metrics", "name": "coralogix", "error": "rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: EOF\"", "interval": "6.314044291s"}
grpc@v1.67.1/clientconn.go:1201 [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to IDLE, last error: connection error: desc = "transport: authentication handshake failed: EOF" {"grpc_log": true}
pickfirst/pickfirst.go:176  [pick-first-lb] [pick-first-lb 0xc0030a0990] Received SubConn state update: 0xc0030a0a20, {ConnectivityState:IDLE ConnectionError:connection error: desc = "transport: authentication handshake failed: EOF" connectedAddress:{Addr: ServerName: Attributes:<nil> BalancerAttributes:<nil> Metadata:<nil>}} {"grpc_log": true}

Watching container logs I found 2 different type of errors.

Version older or equal than 0.98.0, configure web proxy (via ENV variables) and the container is NOT able to resolve the exporter endpoint (DNS does not resolve public names): EVERYTHING WORKS AS EXPECTED - "Channel Connectivity change to READY"

2024-11-05T13:27:03.310Z    info    zapgrpc/zapgrpc.go:176  [core] [Channel #1]Resolver state updated: {
  "Addresses": [
    {
      "Addr": "ingress.eu2.coralogix.com:443",
      "ServerName": "",
      "Attributes": null,
      "BalancerAttributes": null,
      "Metadata": null
    }
  ],
  "Endpoints": [
    {
      "Addresses": [
        {
          "Addr": "ingress.eu2.coralogix.com:443",
          "ServerName": "",
          "Attributes": null,
          "BalancerAttributes": null,
          "Metadata": null
        }
      ],
      "Attributes": null
    }
  ],
  "ServiceConfig": null,
  "Attributes": null
} (resolver returned new addresses) {"grpc_log": true}

info    zapgrpc/zapgrpc.go:176  [core] [Channel #1 SubChannel #2]Subchannel created {"grpc_log": true}
info    zapgrpc/zapgrpc.go:176  [core] [Channel #1]Channel Connectivity change to CONNECTING    {"grpc_log": true}
info    zapgrpc/zapgrpc.go:176  [core] [Channel #1]Channel exiting idle mode    {"grpc_log": true}
info    zapgrpc/zapgrpc.go:176  [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to CONNECTING   {"grpc_log": true}
info    zapgrpc/zapgrpc.go:176  [core] [Channel #1 SubChannel #2]Subchannel picks a new address "ingress.eu2.coralogix.com:443" to connect  {"grpc_log": true}
info    zapgrpc/zapgrpc.go:176  [core] [pick-first-lb 0xc002a1fc50] Received SubConn state update: 0xc002a1fce0, {ConnectivityState:CONNECTING ConnectionError:<nil>}   {"grpc_log": true}
info    prometheusreceiver@v0.98.0/metrics_receiver.go:272  Starting discovery manager  {"kind": "receiver", "name": "prometheus", "data_type": "metrics"}
info    prometheusreceiver@v0.98.0/metrics_receiver.go:250  Scrape job added    {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "jobName": "docker_hosts_metrics"}
debug   discovery/manager.go:286    Starting provider   {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "provider": "static/0", "subs": "map[docker_hosts_metrics:{}]"}
info    service@v0.98.0/service.go:169  Everything is ready. Begin running and processing data.
debug   discovery/manager.go:320    Discoverer channel closed   {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "provider": "static/0"}
info    prometheusreceiver@v0.98.0/metrics_receiver.go:326  Starting scrape manager {"kind": "receiver", "name": "prometheus", "data_type": "metrics"}
warn    localhostgate/featuregate.go:63 The default endpoints for all servers in components will change to use localhost instead of 0.0.0.0 in a future version. Use the feature gate to preview the new default.   {"feature gate ID": "component.UseLocalHostAsDefaultHost"}
info    zapgrpc/zapgrpc.go:176  [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to READY    {"grpc_log": true}
info    zapgrpc/zapgrpc.go:176  [core] [pick-first-lb 0xc002a1fc50] Received SubConn state update: 0xc002a1fce0, {ConnectivityState:READY ConnectionError:<nil>}    {"grpc_log": true}
info    zapgrpc/zapgrpc.go:176  [core] [Channel #1]Channel Connectivity change to READY {"grpc_log": true}

What version did you use?

TAG: 0.99.0 onwards (included) generate errors. TAG: 0.98.0 and prior works as expected.

What config did you use?

Docker compose yaml content:

version: "3.9"
services:
  otel:
    image: otel/opentelemetry-collector-contrib:0.99.0
    command: ["--config=/etc/otelcol-contrib/config.yaml"]
    configs:
      - source: config.yaml
        target: /etc/otelcol-contrib/config.yaml
        mode: 0744         
    networks:
      - private
    environment:      
      HTTP_PROXY: "MY_HTTP_PROXY:PORT"
      HTTPS_PROXY: "MY_HTTPS_PROXY:PORT"      
      NO_PROXY: "my.domain,localhost,127.0.0.1"            
    ports:
      - 4317:4317 
      - 4318:4318 
    deploy:
      mode: replicated
      replicas: 1    
configs:
  config.yaml:
    external: true
networks:
  private:
    driver: overlay
    attachable: true

config.yaml content:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
  otlp/http:
    protocols:
      http:
        endpoint: 0.0.0.0:4318
  prometheus:
    config:
      scrape_configs:              
        - job_name: cadvisor_metrics
          scrape_interval: 1m
          metrics_path: /metrics
          static_configs:
            - targets:
              - 'mydockerhost.my.domain:8080'
exporters:
  coralogix:
    domain: "eu2.coralogix.com"
    private_key: "My_PRIVATE_KEY"
    application_name: "MY_APP_NAME"
    subsystem_name: "MY_SUBSYSTEM_NAME"
    timeout: 30s
service:
  pipelines:
    metrics:
      receivers: [ prometheus ]
      exporters: [ coralogix ]

Additional context

As you can see something changed from version 0.99.0. First of all when using an HTTP/HTTPS proxy any http client should demand the name resolution to the web proxy instead of trying to ask to the DNS server. Older versions are doing this right, newer versions are not.

Even when using a newer version and making the container able to resolve the public name of the Coralogix endpoint we can see that is uses a public IP address in the "Addr": section of the logs, instead of the dns name used in the older versione of the collector. I think that we get the authentication handshake failed: EOF error because the http client checks the TSL certificate presented by the public server and it does not match the IP address that it's trying to use in the connection, and this won't happen if using the real endpoint name that's for sure in the subject alternative name of the certificate.

Searching in the issues i found this: https://github.com/open-telemetry/opentelemetry-collector/issues/10814#issue-2451161832 Not the same issue, even though it's related to the TLS connection of the exporter, but they solved by using an older version.

open-telemetry / opentelemetry-collector