kubernetes-sigs / kueue

Kubernetes-native Job Queueing
https://kueue.sigs.k8s.io
Apache License 2.0
1.37k stars 248 forks source link

Cannot find CA bundle to use for Prometheus scraper with TLS verification enabled. #3259

Open rvasahu-amazon opened 1 day ago

rvasahu-amazon commented 1 day ago

Hi, hope you're well.

I'm trying to set up a Prometheus scraper to access the Kueue metrics endpoint.

---
# create configmap for prometheus scrape config
apiVersion: v1
data:
  # prometheus config
  prometheus.yaml: |
    global:
      scrape_interval: 1m
      scrape_timeout: 10s
    scrape_configs:
    - job_name: 'kueue_metrics'
      scheme: https
      tls_config:
        insecure_skip_verify: false
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
...
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring-ns

Since this needs to be productionised, ideally we'd like insecure_skip_verify under tls_config to be false. I understand that the scraper would need a CA bundle corresponding to the CA that was used to create the self-signed cert for TLS handshake. There isn't much Kueue documentation I can find on this, so I'm having trouble determining how to find and use this cert.

I have a couple questions:

  1. Is there a CA bundle somewhere that would contain a cert for the metrics endpoint? I could then presumably mount and use the bundle. Alternatively, how should I get a cert for subject kueue-controller-manager-metrics-service.kueue-system.svc.cluster.local?
  2. Is it possible that using TLS verification for accessing the metrics endpoint is not a supported use-case? For example, I can see this service monitor does not use TLS verification:
...
spec:
  endpoints:
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    path: /metrics
    port: https
    scheme: https
    tlsConfig:
      insecureSkipVerify: true
...

I would appreciate your insight. Thanks in advance for your help, much appreciated.

rvasahu-amazon commented 1 day ago

I wanted to add some details about what I've done to look into this so far without cluttering the main body of the issue.

My current understanding is (and please correct me if I'm wrong):

  1. The cluster default certificate authority is not used by Kueue (so ca.crt at the same location as token doesn't work).
  2. Kueue has an internal CA that is used for the webhook service, which can be disabled and replaced by an external one if need be. However this is not the CA used for the prometheus metrics server.
  3. Instead, there is separate one used for kueue-controller-manager, including the pod and metrics service.

On those first two points, I checked this by curling the metrics endpoint from within my cluster. This is relevant in that if a cert I use to manually curl the endpoint works, the prometheus scraper is able to use that same cert.

Skipping verification worked for viewing metrics, which is what's expected:

% curl -i https://kueue-controller-manager-metrics-service.kueue-system.svc.cluster.local:8443/metrics -H "Authorization: Bearer $TOKEN" -k
# metrics outputted

However, when attempting to use certs, I was not able to do so:

% curl -i https://kueue-controller-manager-metrics-service.kueue-system.svc.cluster.local:8443/metrics -H "Authorization: Bearer $TOKEN" --cacert /path/to/some/cert.crt
curl: (60) SSL certificate problem: self-signed certificate in certificate chain

I used the cluster default ca.crt and the webhook service .crt, TLS handshake failed in both cases.

For point 3, what I then tried is getting the full certificate chain from the server:

% openssl s_client -connect kueue-controller-manager-metrics-service.kueue-system.svc.cluster.local:8443 -showcerts

I used the first cert in the chain to try and curl the endpoint. At this point, TLS handshake succeeded, but there was a hostname mismatch:

...
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
...
* Server certificate:
*  subject: CN=kueue-controller-manager-684c94f946-wt9gf@1727221200
*  start date: Sep 24 22:39:59 2024 GMT
*  expire date: Sep 24 22:39:59 2025 GMT
*  subjectAltName does not match kueue-controller-manager-metrics-service.kueue-system.svc.cluster.local
* SSL: no alternative certificate subject name matches target host name 'kueue-controller-manager-metrics-service.kueue-system.svc.cluster.local'
* Closing connection
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (OUT), TLS alert, close notify (256):
curl: (60) SSL: no alternative certificate subject name matches target host name 'kueue-controller-manager-metrics-service.kueue-system.svc.cluster.local'
...

I surmise that there must be a CA bundle that was issued by the same CA, and this bundle would have a cert for kueue-controller-manager-metrics-service.kueue-system.svc.cluster.local.

mimowo commented 1 day ago

Thank you for the summary. It will be very useful for investigation.

I'm not yet familiar with this, and it might be chellanging given we are two weeks from the planned 0.9 release, but maybe @tenzen-y or @alculquicondor already have some relevant knowledge here.

Also, as a pointer you may check how Kueue is setup with Prometheus in this project which is our go to setup: https://github.com/GoogleCloudPlatform/ai-on-gke, the best-practices section. Maybe it solves the issue you mention, but I'm not sure.. cc @mbobrovskyi

sky333999 commented 1 day ago

Looks like even the referenced project uses insecureSkipVerify: true as per this.

I see there's an opt-in for cert-manager with the webhook server but the visibility server seems to only use self signed certs with no configuration exposed - any reason for the diff in approaches?

alculquicondor commented 1 day ago

We have mainly tested self-signed certificates. The reason for this was simplicity of the deployment and lack of user demand. If you manage to get it working, we would be happy to review guides and changes, for example, to support cert-manager in the visibility API.