Closed Rotfuks closed 2 months ago
@Rotfuks the cloud controller manager is a component that is deployed very early in the process, probably before the ServiceMonitor
CRD is present. Should be deploy it as a new app, similar to what's done with cilium-servicemonitors-app? Or will the CRDs be present and we can deploy it with the app?
I believe someone else from @giantswarm/team-atlas might be of more help when it comes to that question :)
It seems that this app is required in the bootstrap of a cluster so unless you can create the cluster and have it dependent on the prometheus-operator-crds, you should consider making it another app yes :(
@giantswarm/team-atlas how were these components being monitoring up until now? I can't find a metrics
endpoint for them
From what I see, these weren't monitored at all :
No results whereas :
kube-system aws-cloud-controller-manager-fvrn4 1/1 Running 1 (42d ago) 42d
kube-system aws-cloud-controller-manager-fxrbp 1/1 Running 1 (42d ago) 42d
kube-system aws-cloud-controller-manager-wd8sm 1/1 Running 1 (42d ago) 42d
Could it be because bind-address is not set as an argument so it can only bé scraped when using localhost?
These apps are not exposing any metrics
port AFAIK. So I'm not sure what needs to be done.
Then I would say that the 1st thing to do would be to expose the metrics endpoints in those apps
I don't think they have one, that's what I meant.
I'm confused because the main.go files of the aws cloud controller manager does register metrics 🤔 I Can check in 2 weeks :)
Yes, it does but I don't see any http handler to expose those metrics.
I decided to investigate aws-cloud-controller-manager
more in detail.
Reading the source code, I could see how the controller was collecting metrics https://github.com/search?q=repo%3Akubernetes%2Fcloud-provider-aws+%2Fcloudprovider_aws%2F&type=code
But in the code I couldn't see any http endpoint being exposed for the metrics. I searched github, forums, and asked on Kubernetes Slack but I did not get any answers .
I checked the parameters that we currently pass to the aws-cloud-controller-manager
container, and one says --secure-port=10267
. So I port-forwarded
using that port and hit the /metrics
endpoint.
Internal Server Error: "/metrics": subjectaccessreviews.authorization.k8s.io is forbidden: User "system:serviceaccount:kube-system:aws-cloud-controller-manager" cannot create resource "subjectaccessreviews" in API group "authorization.k8s.io" at the cluster scope
Apparently this is a k8s resource used to determine whether a given user or service account has permission to perform a specific action on a resource. So I tried updating the ClusterRole
for aws-cloud-controller-manager
adding the SAR resource, and hitting the endpoint again.
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {},
"status": "Failure",
"message": "forbidden: User \"system:anonymous\" cannot get path \"/metrics\"",
"reason": "Forbidden",
"details": {},
"code": 403
}
I need to pass a serviceaccount
token. So I created a new token for the aws-cloud-controller-manager
SA with kubectl create token aws-cloud-controller-manager -n kube-system --duration 10m
. Used it in my request and
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {},
"status": "Failure",
"message": "forbidden: User \"system:anonymous\" cannot get path \"/metrics\"",
"reason": "Forbidden",
"details": {},
"code": 403
}
Let's check the logs in the controller
E0808 14:51:04.851806 1 authentication.go:73] "Unable to authenticate the request" err="[invalid bearer token, tokenreviews.authentication.k8s.io is forbidden: User \"system:serviceaccount:kube-system:aws-cloud-controller-manager\" cannot create resource \"tokenreviews\" in API group \"authentication.k8s.io\" at the cluster scope]"
The tokenreviews
API is used by Kubernetes components to verify the validity of bearer tokens, which are often used for authentication. So I added that to the aws-cloud-controller-manager
ClusterRole
and tried again.
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {},
"status": "Failure",
"message": "Unauthorized",
"reason": "Unauthorized",
"code": 401
}
I added this new permissions to the aws-cloud-controller-manager
ClusterRole`
rules:
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
Try hitting the metrics endpoint again:
# HELP apiserver_audit_event_total [ALPHA] Counter of audit events generated and sent to the audit backend.
# TYPE apiserver_audit_event_total counter
apiserver_audit_event_total 0
# HELP apiserver_audit_requests_rejected_total [ALPHA] Counter of apiserver requests rejected due to an error in audit logging backend.
# TYPE apiserver_audit_requests_rejected_total counter
apiserver_audit_requests_rejected_total 0
# HELP apiserver_client_certificate_expiration_seconds [ALPHA] Distribution of the remaining lifetime on the certificate used to authenticate a request.
# TYPE apiserver_client_certificate_expiration_seconds histogram
apiserver_client_certificate_expiration_seconds_bucket{le="0"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="1800"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="3600"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="7200"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="21600"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="43200"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="86400"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="172800"} 0
...
..
.
I finally got the metrics.
I guess we need
nonResourceURLs: ["/metrics"]
permissions on its role
/clusterrole
subjectaccessreviews
and tokenreviews
permissions to aws-cloud-controller-manager
.That's something.really thorough investigation. Kudos to you Jose
After discussing it internally, we've decided that scrapping these metrics would add a lot of new time series to our monitoring platform for little gain. We could monitor the state of the DaemonSet
instead of relying on the up
metric exposed by the application itself. All the other metrics exposed by the application seem to be too "low level", and I don't think we would use them for our alerting.
What do you think?
It's up to you if you don't think the metrics are not relevant then so bé it ;)
It seems that these components are already covered by these rules, so we should get alerted when they are not available. We can close this one.
Motivation
There are still some leftover apps (although not a lot) that still need to use a service monitor. Without this, we will not be able to tear down our Prometheus stack.
To easily find out what is not monitored via service monitors, you can connect to a MC and WC prometheus using opsctl open -i -a prometheus --workload-cluster= and check out the targets page. We already identified and resolved most of the service monitors missing, but identified one turtles app that's not done yet.
TODO
Service monitors here are only relevant for CAPI
Outcome