giantswarm / roadmap

Giant Swarm Product Roadmap
https://github.com/orgs/giantswarm/projects/273
Apache License 2.0
3 stars 0 forks source link

Service monitors for cloud-controller-manager-app #3538

Closed Rotfuks closed 2 months ago

Rotfuks commented 3 months ago

Motivation

There are still some leftover apps (although not a lot) that still need to use a service monitor. Without this, we will not be able to tear down our Prometheus stack.

To easily find out what is not monitored via service monitors, you can connect to a MC and WC prometheus using opsctl open -i -a prometheus --workload-cluster= and check out the targets page. We already identified and resolved most of the service monitors missing, but identified one turtles app that's not done yet.

TODO

Service monitors here are only relevant for CAPI

Outcome

fiunchinho commented 2 months ago

@Rotfuks the cloud controller manager is a component that is deployed very early in the process, probably before the ServiceMonitor CRD is present. Should be deploy it as a new app, similar to what's done with cilium-servicemonitors-app? Or will the CRDs be present and we can deploy it with the app?

Rotfuks commented 2 months ago

I believe someone else from @giantswarm/team-atlas might be of more help when it comes to that question :)

QuentinBisson commented 2 months ago

It seems that this app is required in the bootstrap of a cluster so unless you can create the cluster and have it dependent on the prometheus-operator-crds, you should consider making it another app yes :(

fiunchinho commented 2 months ago

@giantswarm/team-atlas how were these components being monitoring up until now? I can't find a metrics endpoint for them

QuantumEnigmaa commented 2 months ago

From what I see, these weren't monitored at all : image

No results whereas :

kube-system           aws-cloud-controller-manager-fvrn4                                  1/1     Running     1 (42d ago)     42d                                                                           
kube-system           aws-cloud-controller-manager-fxrbp                                  1/1     Running     1 (42d ago)     42d
kube-system           aws-cloud-controller-manager-wd8sm                                  1/1     Running     1 (42d ago)     42d
QuentinBisson commented 2 months ago

Could it be because bind-address is not set as an argument so it can only bé scraped when using localhost?

fiunchinho commented 2 months ago

These apps are not exposing any metrics port AFAIK. So I'm not sure what needs to be done.

QuantumEnigmaa commented 2 months ago

Then I would say that the 1st thing to do would be to expose the metrics endpoints in those apps

fiunchinho commented 2 months ago

I don't think they have one, that's what I meant.

QuentinBisson commented 2 months ago

I'm confused because the main.go files of the aws cloud controller manager does register metrics 🤔 I Can check in 2 weeks :)

fiunchinho commented 2 months ago

Yes, it does but I don't see any http handler to expose those metrics.

fiunchinho commented 2 months ago

I decided to investigate aws-cloud-controller-manager more in detail.

Reading the source code, I could see how the controller was collecting metrics https://github.com/search?q=repo%3Akubernetes%2Fcloud-provider-aws+%2Fcloudprovider_aws%2F&type=code

But in the code I couldn't see any http endpoint being exposed for the metrics. I searched github, forums, and asked on Kubernetes Slack but I did not get any answers .

I checked the parameters that we currently pass to the aws-cloud-controller-manager container, and one says --secure-port=10267. So I port-forwarded using that port and hit the /metrics endpoint. Internal Server Error: "/metrics": subjectaccessreviews.authorization.k8s.io is forbidden: User "system:serviceaccount:kube-system:aws-cloud-controller-manager" cannot create resource "subjectaccessreviews" in API group "authorization.k8s.io" at the cluster scope

Apparently this is a k8s resource used to determine whether a given user or service account has permission to perform a specific action on a resource. So I tried updating the ClusterRole for aws-cloud-controller-manager adding the SAR resource, and hitting the endpoint again.

{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/metrics\"",
  "reason": "Forbidden",
  "details": {},
  "code": 403
}

I need to pass a serviceaccount token. So I created a new token for the aws-cloud-controller-manager SA with kubectl create token aws-cloud-controller-manager -n kube-system --duration 10m. Used it in my request and

{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/metrics\"",
  "reason": "Forbidden",
  "details": {},
  "code": 403
}

Let's check the logs in the controller

E0808 14:51:04.851806       1 authentication.go:73] "Unable to authenticate the request" err="[invalid bearer token, tokenreviews.authentication.k8s.io is forbidden: User \"system:serviceaccount:kube-system:aws-cloud-controller-manager\" cannot create resource \"tokenreviews\" in API group \"authentication.k8s.io\" at the cluster scope]"

The tokenreviews API is used by Kubernetes components to verify the validity of bearer tokens, which are often used for authentication. So I added that to the aws-cloud-controller-manager ClusterRole and tried again.

{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "Unauthorized",
  "reason": "Unauthorized",
  "code": 401
}

I added this new permissions to the aws-cloud-controller-manager ClusterRole`

rules:
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]

Try hitting the metrics endpoint again:

# HELP apiserver_audit_event_total [ALPHA] Counter of audit events generated and sent to the audit backend.
# TYPE apiserver_audit_event_total counter
apiserver_audit_event_total 0
# HELP apiserver_audit_requests_rejected_total [ALPHA] Counter of apiserver requests rejected due to an error in audit logging backend.
# TYPE apiserver_audit_requests_rejected_total counter
apiserver_audit_requests_rejected_total 0
# HELP apiserver_client_certificate_expiration_seconds [ALPHA] Distribution of the remaining lifetime on the certificate used to authenticate a request.
# TYPE apiserver_client_certificate_expiration_seconds histogram
apiserver_client_certificate_expiration_seconds_bucket{le="0"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="1800"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="3600"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="7200"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="21600"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="43200"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="86400"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="172800"} 0
...
..
.

I finally got the metrics.

I guess we need

QuentinBisson commented 2 months ago

That's something.really thorough investigation. Kudos to you Jose

fiunchinho commented 2 months ago

After discussing it internally, we've decided that scrapping these metrics would add a lot of new time series to our monitoring platform for little gain. We could monitor the state of the DaemonSet instead of relying on the up metric exposed by the application itself. All the other metrics exposed by the application seem to be too "low level", and I don't think we would use them for our alerting.

What do you think?

QuentinBisson commented 2 months ago

It's up to you if you don't think the metrics are not relevant then so bé it ;)

fiunchinho commented 2 months ago

It seems that these components are already covered by these rules, so we should get alerted when they are not available. We can close this one.