Open rvasahu-amazon opened 1 month ago
I wanted to add some details about what I've done to look into this so far without cluttering the main body of the issue.
My current understanding is (and please correct me if I'm wrong):
ca.crt
at the same location as token
doesn't work). kueue-controller-manager
, including the pod and metrics service.On those first two points, I checked this by curling the metrics endpoint from within my cluster. This is relevant in that if a cert I use to manually curl the endpoint works, the prometheus scraper is able to use that same cert.
Skipping verification worked for viewing metrics, which is what's expected:
% curl -i https://kueue-controller-manager-metrics-service.kueue-system.svc.cluster.local:8443/metrics -H "Authorization: Bearer $TOKEN" -k
# metrics outputted
However, when attempting to use certs, I was not able to do so:
% curl -i https://kueue-controller-manager-metrics-service.kueue-system.svc.cluster.local:8443/metrics -H "Authorization: Bearer $TOKEN" --cacert /path/to/some/cert.crt
curl: (60) SSL certificate problem: self-signed certificate in certificate chain
I used the cluster default ca.crt
and the webhook service .crt
, TLS handshake failed in both cases.
For point 3, what I then tried is getting the full certificate chain from the server:
% openssl s_client -connect kueue-controller-manager-metrics-service.kueue-system.svc.cluster.local:8443 -showcerts
I used the first cert in the chain to try and curl the endpoint. At this point, TLS handshake succeeded, but there was a hostname mismatch:
...
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
...
* Server certificate:
* subject: CN=kueue-controller-manager-684c94f946-wt9gf@1727221200
* start date: Sep 24 22:39:59 2024 GMT
* expire date: Sep 24 22:39:59 2025 GMT
* subjectAltName does not match kueue-controller-manager-metrics-service.kueue-system.svc.cluster.local
* SSL: no alternative certificate subject name matches target host name 'kueue-controller-manager-metrics-service.kueue-system.svc.cluster.local'
* Closing connection
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (OUT), TLS alert, close notify (256):
curl: (60) SSL: no alternative certificate subject name matches target host name 'kueue-controller-manager-metrics-service.kueue-system.svc.cluster.local'
...
I surmise that there must be a CA bundle that was issued by the same CA, and this bundle would have a cert for kueue-controller-manager-metrics-service.kueue-system.svc.cluster.local
.
Thank you for the summary. It will be very useful for investigation.
I'm not yet familiar with this, and it might be chellanging given we are two weeks from the planned 0.9 release, but maybe @tenzen-y or @alculquicondor already have some relevant knowledge here.
Also, as a pointer you may check how Kueue is setup with Prometheus in this project which is our go to setup: https://github.com/GoogleCloudPlatform/ai-on-gke, the best-practices section. Maybe it solves the issue you mention, but I'm not sure.. cc @mbobrovskyi
Looks like even the referenced project uses insecureSkipVerify: true
as per this.
I see there's an opt-in for cert-manager with the webhook server but the visibility server seems to only use self signed certs with no configuration exposed - any reason for the diff in approaches?
We have mainly tested self-signed certificates. The reason for this was simplicity of the deployment and lack of user demand. If you manage to get it working, we would be happy to review guides and changes, for example, to support cert-manager in the visibility API.
Hi @alculquicondor @mimowo,
Thanks for the information, this is much appreciated. Unfortunately we don't have bandwidth to take this up right now. We can try looking into this later, and we'll make sure to share updates when we do. In the meantime, we will go ahead with self-signed certs.
If it isn't too much trouble, I have one more question. How easy would it be for a third party to spoof a Kueue self-signed cert? Is there any security testing done along these lines?
Thanks once again.
We haven't conducted such testing.
Hi, hope you're well.
I'm trying to set up a Prometheus scraper to access the Kueue metrics endpoint.
Since this needs to be productionised, ideally we'd like
insecure_skip_verify
undertls_config
to be false. I understand that the scraper would need a CA bundle corresponding to the CA that was used to create the self-signed cert for TLS handshake. There isn't much Kueue documentation I can find on this, so I'm having trouble determining how to find and use this cert.I have a couple questions:
kueue-controller-manager-metrics-service.kueue-system.svc.cluster.local
?I would appreciate your insight. Thanks in advance for your help, much appreciated.