open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.78k stars 2.19k forks source link

TLS config in TA receiver doesn't take effect #33370

Open ItielOlenick opened 1 month ago

ItielOlenick commented 1 month ago

Component(s)

receiver/prometheus

What happened?

Description

After https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/31449 was merged it is expected that the client will use the TLS setting to communicate with the TA.

When adding TLS config to the config.yaml file, the client is not sending the tls certs in the requests to the Target Allocator.

In practice the TA server (or mock of) complains that the client does not provide certificates in the requests.

Steps to Reproduce

Set up a TA (Or mock) with mTLS - forcing the client to provide certificates.

Setup the collector with the tls key in the config:

collector.yaml

receivers:
    prometheus:
        config: {}
        target_allocator:
            collector_id: ${POD_NAME}
            endpoint: https://tls-debug-service:443
            interval: 30s
            tls:
                ca_file: /tls/ca.crt
                cert_file: /tls/tls.crt
                key_file: /tls/tls.key

Expected Result

TLS certificates added to the request

Actual Result

Server complains that the client didn't provide a certificate

From the TA (or mock): TA:

http: TLS handshake error from 10.42.0.201:40804: tls: client didn't provide a certificate

From mock using open_ssl:

4007B2B0CD7F0000:error:0A0000C7:SSL routines:tls_process_client_certificate:peer did not return a certificate:../ssl/statem/statem_srvr.c:3510:

Collector version

0.101.0

Environment information

Environment

Kubernetes

OpenTelemetry Collector configuration

exporters:
    debug:
        verbosity: detailed
receivers:
    prometheus:
        config: {}
        target_allocator:
            collector_id: ${POD_NAME}
            endpoint: https://tls-debug-service:443
            interval: 30s
            tls:
                ca_file: /tls/ca.crt
                cert_file: /tls/tls.crt
                key_file: /tls/tls.key
service:
    pipelines:
        metrics:
            exporters:
                - debug
            processors: []
            receivers:
                - prometheus

Log output

From the Collector:

2024-06-04T13:18:58.967Z info service@v0.101.0/service.go:102 Setting up own telemetry...
2024-06-04T13:18:58.967Z info service@v0.101.0/telemetry.go:103 Serving metrics {"address": ":8888", "level": "Normal"}
2024-06-04T13:18:58.967Z info exporter@v0.101.0/exporter.go:275 Development component. May change in the future. {"kind": "exporter", "data_type": "metrics", "name": "debug"}
2024-06-04T13:18:58.968Z info service@v0.101.0/service.go:169 Starting otelcol... {"Version": "0.101.0", "NumCPU": 12}
2024-06-04T13:18:58.968Z info extensions/extensions.go:34 Starting extensions...
2024-06-04T13:18:58.969Z info prometheusreceiver@v0.101.0/metrics_receiver.go:275 Starting discovery manager {"kind": "receiver", "name": "prometheus", "data_type": "metrics"}
2024-06-04T13:18:58.970Z info prometheusreceiver@v0.101.0/metrics_receiver.go:121 Starting target allocator discovery {"kind": "receiver", "name": "prometheus", "data_type": "metrics"}
2024-06-04T13:18:58.980Z info prometheusreceiver@v0.101.0/metrics_receiver.go:253 Scrape job added {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "jobName": "serviceMonitor/default/nginx-servicemonitor/0"}
2024-06-04T13:18:58.980Z info service@v0.101.0/service.go:195 Everything is ready. Begin running and processing data.
2024-06-04T13:18:58.980Z warn localhostgate/featuregate.go:63 The default endpoints for all servers in components will change to use localhost instead of 0.0.0.0 in a future version. Use the feature gate to preview the new default. {"feature gate ID": "component.UseLocalHostAsDefaultHost"}
2024-06-04T13:18:58.980Z info prometheusreceiver@v0.101.0/metrics_receiver.go:340 Starting scrape manager {"kind": "receiver", "name": "prometheus", "data_type": "metrics"}
2024-06-04T13:18:58.995Z error refresh/refresh.go:71 Unable to refresh target groups {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "discovery": "http", "config": "serviceMonitor/default/nginx-servicemonitor/0", "err": "Get \"https://tls-debug-service:443/jobs/serviceMonitor%2Fdefault%2Fnginx-servicemonitor%2F0/targets?collector_id=collector-with-ta-collector-0\": remote error: tls: certificate required"}
github.com/prometheus/prometheus/discovery/refresh.(*Discovery).Run

Additional context

No response

github-actions[bot] commented 1 month ago

Pinging code owners:

swiatekm commented 1 month ago

I honestly can't see how this can be an issue with the prometheus receiver. The use of confighttp in #31452 is incredibly basic, and there's even a test verifying that the TLS config does in fact get read. So it's either a bug in confighttp, or the test methodology is wrong somehow. @ItielOlenick are you using the standalone collector for your tests? Or the Operator?

ItielOlenick commented 1 month ago

@swiatekm-sumo I'm using both.

The TLS cert is in fact read from the config. When testing with a cert that doesn't exist it complains about the file not existing. I think it is somehow ignored. Could this be due to the fact that it is used for servers, and in this case the TA is the server and the receiver is a client?

I've tested with both a standalone TA, operator one and a mock server serving the same endpoints. In all cases I used the same certs (both server and client). In all cases I am able to get a response using curl or openssl. From both withing the cluster and using port forwarding.

Tried everything I can think of.

swiatekm commented 1 month ago

It looks like it's loaded and added to the transport here: https://github.com/open-telemetry/opentelemetry-collector/blob/15faf6206bedc9d5a188f131fc5891904e29a34b/config/confighttp/confighttp.go#L122. Not sure what the problem could be. I'll try to do a minimal reproduction tomorrow using the otlp exporter.

ItielOlenick commented 1 month ago

It seems like the first call to /scrape_configs is OK:

info prometheusreceiver@v0.101.0/metrics_receiver.go:253 Scrape job added {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "jobName": "serviceMonitor/default/nginx-servicemonitor/0"}

but then the call to the /jobs endpoint is the issue:

error refresh/refresh.go:90 Unable to refresh target groups {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "discovery": "http", "config": "serviceMonitor/default/nginx-servicemonitor/0", "err": "Get \"https://tls-debug-service:443/jobs/serviceMonitor%2Fdefault%2Fnginx-servicemonitor%2F0/targets?collector_id=collector-with-ta-collector-0\": remote error: tls: certificate required"}
github.com/prometheus/prometheus/discovery/refresh.(*Discovery).Run
github.com/prometheus/prometheus@v0.51.2-0.20240405174432-b4a973753c6e/discovery/refresh/refresh.go:90
swiatekm commented 1 month ago

It seems like the first call to /scrape_configs is OK:

info prometheusreceiver@v0.101.0/metrics_receiver.go:253 Scrape job added {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "jobName": "serviceMonitor/default/nginx-servicemonitor/0"}

but then the call to the /jobs endpoint is the issue:

error refresh/refresh.go:90 Unable to refresh target groups {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "discovery": "http", "config": "serviceMonitor/default/nginx-servicemonitor/0", "err": "Get \"https://tls-debug-service:443/jobs/serviceMonitor%2Fdefault%2Fnginx-servicemonitor%2F0/targets?collector_id=collector-with-ta-collector-0\": remote error: tls: certificate required"}
github.com/prometheus/prometheus/discovery/refresh.(*Discovery).Run
github.com/prometheus/prometheus@v0.51.2-0.20240405174432-b4a973753c6e/discovery/refresh/refresh.go:90

That makes sense, the client config only applies to calls to /scrape_configs. The calls to /jobs are done by Prometheus' HTTP Service Discovery, and there we don't configure TLS.

In that case we have two simple solutions: we can either get targets over plain http even when TLS is enabled, or we can configure HTTP SD config to use TLS as well. The latter seems a bit hacky in that we probably should make an effort to configure it the same way as the confighttp client is configured, which may not be so obvious.

ItielOlenick commented 1 month ago

First one sounds easier but the second sounds more complete. Implementing the first suggestion would mean providing the regular non secure endpoint as well I assume. At any rate, now that we know the source of the issue I'll move forward with https://github.com/open-telemetry/opentelemetry-operator/issues/1669

ItielOlenick commented 1 month ago

@swiatekm-sumo looking over this again, couldn't we add the TA client tls config we provided to the httpSD?

httpSD.HTTPClientConfig.TLSConfig has a TLS config similar to what was implemented in https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/31449

ItielOlenick commented 1 month ago

Tested with

httpSD.HTTPClientConfig.TLSConfig.CAFile = allocConf.TLSSetting.CAFile
httpSD.HTTPClientConfig.TLSConfig.CertFile = allocConf.TLSSetting.CertFile
httpSD.HTTPClientConfig.TLSConfig.KeyFile = allocConf.TLSSetting.KeyFile

Works.

swiatekm commented 1 month ago

Yeah, that does work. My point was more that if we start applying settings from confighttp to httpSD, then we should try to do it as comprehensively as possible, or at least document that only the TLS parts carry over.

ItielOlenick commented 3 weeks ago

Yeah, that does work. My point was more that if we start applying settings from confighttp to httpSD, then we should try to do it as comprehensively as possible, or at least document that only the TLS parts carry over.

I'd like to work on this so we can move forward with https://github.com/open-telemetry/opentelemetry-operator/pull/3015

swiatekm commented 3 weeks ago

@dashpole @Aneurysm9 are you ok with this approach?

dashpole commented 3 weeks ago

Can you just configure http_sd_config's TLS settings from the operator?

https://prometheus.io/docs/prometheus/latest/configuration/configuration/#http_sd_config

ItielOlenick commented 3 weeks ago

@dashpole as is httpSD.HTTPClientConfig.TLSConfig = allocConf.TLSSetting ?

If not I'm not sure how you intend the collector reading and applying this setting

dashpole commented 3 weeks ago

Sorry, nevermind. I agree with https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/33370#issuecomment-2149299108 that we should ideally translate as much of httpsd as possible.

ItielOlenick commented 3 weeks ago

Ok, makes sense. I'll draft something up.

ItielOlenick commented 3 weeks ago

Just to clarify, we are talking about using allocConf's http related config in the httpSD.HTTPClientConfig?

dashpole commented 3 weeks ago

yes.

ItielOlenick commented 3 weeks ago

It seems like the only relevant fields for httpSD to be translated from the allocConf's config are in fact TLSSetting and maybe ProxyURL

Am i missing something?