Docker images tagged: 0.99.0 onward don't support the usage of a web proxy in an air-gapped environment
We have an OTEL collector running in a Docker Swarm cluster ( using image: otel/opentelemetry-collector-contrib:0.112.0 ) within internal network which does not allow direct connectivity to public internet (air-gapped environment).
The internal DNS used by the docker host does not resolve external names, this is for security purposes because the public names resolution is demanded to the company web proxy, we must go through that device to reach the internet for HTTP/HTTPS connections.
In our OTEL collector configuration we have an exporter that points to a public SAAS provider (Coralogix - eu2.coralogix.com) with authentication via bearer token.
As explained in the documentation, we used the HTTP_PROXY, HTTPS_PROXY and NO_PROXY environment variables to instruct the collector to use the company web proxy but it isn't working as expected in the versions after 0.98.0 .
By starting the collector in debug mode (in the configurations) we noticed that the data arrives correctly to the collector, but the collector fails to export.
The collector is not correctly using the web proxy to resolve public names and even if we let the container resolve the exporter still fails.
After various troubleshooting attempts, we found that older versions support proxying ( latest version that still works is otel/opentelemetry-collector-contrib:0.98.0 ) and the connection to the SAAS endpoint.
Steps to reproduce
1) Collector deployed in internal network which does not allow outgoing traffic to backends in public internet.
2) Have an internal web proxy that can connect to the internet
3) Use a docker image of OTEL collector from tag 0.99.0 onwards
4) Use proxy configuration with HTTP_PROXY, HTTPS_PROXY (and optionally NO_PROXY) environment variables with the web proxy of point 2) at container starts.
5) Export the data to external network (external service)
What did you see instead?
We have tested different situations, these are the different outcomes:
Version newer or equal than 0.99.0, configure web proxy (via ENV variables) and the container is NOT able to resolve the exporter endpoint (DNS does not resolve public names): "name resolver error: produced zero addresses"
pickfirst/pickfirst.go:122 [pick-first-lb] [pick-first-lb 0xc003a92ab0] Received error from the name resolver: produced zero addresses {"grpc_log": true}
grpc@v1.67.1/clientconn.go:544 [core] [Channel #2]Channel Connectivity change to TRANSIENT_FAILURE {"grpc_log": true}
grpcsync/callback_serializer.go:94 [core] error from balancer.UpdateClientConnState: bad resolver state {"grpc_log": true}
internal/retry_sender.go:126 Exporting failed. Will retry the request after interval. {"kind": "exporter", "data_type": "traces", "name": "*****", "error": "rpc error: code = Unavailable desc = name resolver error: produced zero addresses", "interval": "5.121492409s"}
Version newer or equal than 0.99.0, configure web proxy (via ENV variables) and the container is ABLE to resolve the exporter endpoint: "authentication handshake failed: EOF"
Watching container logs I found 2 different type of errors.
Version older or equal than 0.98.0, configure web proxy (via ENV variables) and the container is NOT able to resolve the exporter endpoint (DNS does not resolve public names): EVERYTHING WORKS AS EXPECTED - "Channel Connectivity change to READY"
2024-11-05T13:27:03.310Z info zapgrpc/zapgrpc.go:176 [core] [Channel #1]Resolver state updated: {
"Addresses": [
{
"Addr": "ingress.eu2.coralogix.com:443",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Endpoints": [
{
"Addresses": [
{
"Addr": "ingress.eu2.coralogix.com:443",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Attributes": null
}
],
"ServiceConfig": null,
"Attributes": null
} (resolver returned new addresses) {"grpc_log": true}
info zapgrpc/zapgrpc.go:176 [core] [Channel #1 SubChannel #2]Subchannel created {"grpc_log": true}
info zapgrpc/zapgrpc.go:176 [core] [Channel #1]Channel Connectivity change to CONNECTING {"grpc_log": true}
info zapgrpc/zapgrpc.go:176 [core] [Channel #1]Channel exiting idle mode {"grpc_log": true}
info zapgrpc/zapgrpc.go:176 [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to CONNECTING {"grpc_log": true}
info zapgrpc/zapgrpc.go:176 [core] [Channel #1 SubChannel #2]Subchannel picks a new address "ingress.eu2.coralogix.com:443" to connect {"grpc_log": true}
info zapgrpc/zapgrpc.go:176 [core] [pick-first-lb 0xc002a1fc50] Received SubConn state update: 0xc002a1fce0, {ConnectivityState:CONNECTING ConnectionError:<nil>} {"grpc_log": true}
info prometheusreceiver@v0.98.0/metrics_receiver.go:272 Starting discovery manager {"kind": "receiver", "name": "prometheus", "data_type": "metrics"}
info prometheusreceiver@v0.98.0/metrics_receiver.go:250 Scrape job added {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "jobName": "docker_hosts_metrics"}
debug discovery/manager.go:286 Starting provider {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "provider": "static/0", "subs": "map[docker_hosts_metrics:{}]"}
info service@v0.98.0/service.go:169 Everything is ready. Begin running and processing data.
debug discovery/manager.go:320 Discoverer channel closed {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "provider": "static/0"}
info prometheusreceiver@v0.98.0/metrics_receiver.go:326 Starting scrape manager {"kind": "receiver", "name": "prometheus", "data_type": "metrics"}
warn localhostgate/featuregate.go:63 The default endpoints for all servers in components will change to use localhost instead of 0.0.0.0 in a future version. Use the feature gate to preview the new default. {"feature gate ID": "component.UseLocalHostAsDefaultHost"}
info zapgrpc/zapgrpc.go:176 [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to READY {"grpc_log": true}
info zapgrpc/zapgrpc.go:176 [core] [pick-first-lb 0xc002a1fc50] Received SubConn state update: 0xc002a1fce0, {ConnectivityState:READY ConnectionError:<nil>} {"grpc_log": true}
info zapgrpc/zapgrpc.go:176 [core] [Channel #1]Channel Connectivity change to READY {"grpc_log": true}
What version did you use?
TAG: 0.99.0 onwards (included) generate errors.
TAG: 0.98.0 and prior works as expected.
As you can see something changed from version 0.99.0. First of all when using an HTTP/HTTPS proxy any http client should demand the name resolution to the web proxy instead of trying to ask to the DNS server.
Older versions are doing this right, newer versions are not.
Even when using a newer version and making the container able to resolve the public name of the Coralogix endpoint we can see that is uses a public IP address in the "Addr": section of the logs, instead of the dns name used in the older versione of the collector.
I think that we get the authentication handshake failed: EOF error because the http client checks the TSL certificate presented by the public server and it does not match the IP address that it's trying to use in the connection, and this won't happen if using the real endpoint name that's for sure in the subject alternative name of the certificate.
Docker images tagged: 0.99.0 onward don't support the usage of a web proxy in an air-gapped environment
We have an OTEL collector running in a Docker Swarm cluster ( using image: otel/opentelemetry-collector-contrib:0.112.0 ) within internal network which does not allow direct connectivity to public internet (air-gapped environment). The internal DNS used by the docker host does not resolve external names, this is for security purposes because the public names resolution is demanded to the company web proxy, we must go through that device to reach the internet for HTTP/HTTPS connections. In our OTEL collector configuration we have an exporter that points to a public SAAS provider (Coralogix - eu2.coralogix.com) with authentication via bearer token. As explained in the documentation, we used the HTTP_PROXY, HTTPS_PROXY and NO_PROXY environment variables to instruct the collector to use the company web proxy but it isn't working as expected in the versions after 0.98.0 . By starting the collector in debug mode (in the configurations) we noticed that the data arrives correctly to the collector, but the collector fails to export. The collector is not correctly using the web proxy to resolve public names and even if we let the container resolve the exporter still fails. After various troubleshooting attempts, we found that older versions support proxying ( latest version that still works is otel/opentelemetry-collector-contrib:0.98.0 ) and the connection to the SAAS endpoint.
Steps to reproduce
1) Collector deployed in internal network which does not allow outgoing traffic to backends in public internet. 2) Have an internal web proxy that can connect to the internet 3) Use a docker image of OTEL collector from tag 0.99.0 onwards 4) Use proxy configuration with HTTP_PROXY, HTTPS_PROXY (and optionally NO_PROXY) environment variables with the web proxy of point 2) at container starts. 5) Export the data to external network (external service)
What did you see instead?
We have tested different situations, these are the different outcomes: Version newer or equal than 0.99.0, configure web proxy (via ENV variables) and the container is NOT able to resolve the exporter endpoint (DNS does not resolve public names): "name resolver error: produced zero addresses"
Version newer or equal than 0.99.0, configure web proxy (via ENV variables) and the container is ABLE to resolve the exporter endpoint: "authentication handshake failed: EOF"
Watching container logs I found 2 different type of errors.
Version older or equal than 0.98.0, configure web proxy (via ENV variables) and the container is NOT able to resolve the exporter endpoint (DNS does not resolve public names): EVERYTHING WORKS AS EXPECTED - "Channel Connectivity change to READY"
What version did you use?
TAG: 0.99.0 onwards (included) generate errors. TAG: 0.98.0 and prior works as expected.
What config did you use?
Docker compose yaml content:
config.yaml content:
Additional context
As you can see something changed from version 0.99.0. First of all when using an HTTP/HTTPS proxy any http client should demand the name resolution to the web proxy instead of trying to ask to the DNS server. Older versions are doing this right, newer versions are not.
Even when using a newer version and making the container able to resolve the public name of the Coralogix endpoint we can see that is uses a public IP address in the
"Addr":
section of the logs, instead of the dns name used in the older versione of the collector. I think that we get theauthentication handshake failed: EOF
error because the http client checks the TSL certificate presented by the public server and it does not match the IP address that it's trying to use in the connection, and this won't happen if using the real endpoint name that's for sure in the subject alternative name of the certificate.Searching in the issues i found this: https://github.com/open-telemetry/opentelemetry-collector/issues/10814#issue-2451161832 Not the same issue, even though it's related to the TLS connection of the exporter, but they solved by using an older version.