nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
2 stars 0 forks source link

Thanos on nerc-ocp-prod returns a 403 when querying for metrics #683

Closed naved001 closed 3 months ago

naved001 commented 3 months ago

From slack:

I am getting a 403 when querying data from thanos (thanos-querier-openshift-monitoring.apps.shift.nerc.mghpcc.org) on nerc-ocp-prod.

It stopped working after Aug 5th. This is where I gather the metrics for billing purposes.

Luckily the regular prometheus endpoint is responding correctly (prometheus-k8s-openshift-monitoring.apps.shift.nerc.mghpcc.org), so I just gathered the data from the last week before it is no longer retained.

Response to the query:

ipdb> response <Response [403]> ipdb> response.text 'Forbidden (user=system:serviceaccount:naved-test:metrics-reader, verb=get, resource=prometheuses, subresource=api)\n'

Justin looked at the pod logs for thanos-querier:

thanos-querier-588dd7d7d8-xtnfn kube-rbac-proxy-web I0813 17:38:22.489322 1 log.go:194] http: TLS handshake error from 10.128.16.1:59056: write tcp 10.129.16.14:9091->10.128.16.1:59056: write: connection reset by peer

Thorsten noticed that the prometheus PVC has crossed 85% usage.

image

naved001 commented 3 months ago

'Forbidden (user=system:serviceaccount:naved-test:metrics-reader, verb=get, resource=prometheuses, subresource=api)\n'

Based on that, I decided to use my account token (which has more permissions than the limited service account token) to query thanos-querier and that worked.

I did try to give a test serviceaccount permissions on the resource "prometheuses" but that didn't work; I will try it again just to make sure that I didn't make a mistake.

Note that all the pods were restarted on Aug 5th (the maintenance window) so I suspect that some update changed the behavior for thanos.

naved001 commented 3 months ago

@larsks I created a clusterrole that looks like:

naved@computer ~ % oc get clusterrole billing-metrics-reader-cr -o yaml |oc neat
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: billing-metrics-reader-cr
rules:
- apiGroups:
  - ""
  resources:
  - namespaces
  verbs:
  - get
  - list
- apiGroups:
  - monitoring.coreos.com
  resources:
  - prometheuses
  verbs:
  - get
  - list

A clusterrolebinding that bound a serviceaccount in my test namespace

 naved@computer ~ % oc get clusterrolebinding billing-metrics-reader-crb -o yaml |oc neat
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: billing-metrics-reader-crb
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: billing-metrics-reader-cr
subjects:
- kind: ServiceAccount
  name: metrics-reader
  namespace: naved-test

With the new token from the serviceaccount, I still got the same error:

ipdb> response
<Response [403]>
ipdb> response.text
'Forbidden (user=system:serviceaccount:naved-test:metrics-reader, verb=get, resource=prometheuses, subresource=api)\n'
ipdb>

Here's a URL that I try to get

'https://thanos-querier-openshift-monitoring.apps.shift.nerc.mghpcc.org/api/v1/query_range?query=kube_pod_resource_request{unit="cores"} unless on(pod, namespace) kube_pod_status_unschedulable&start=2024-08-14T00:00:00Z&end=2024-08-14T23:59:59Z&step=15m'
larsks commented 3 months ago

I think the solution is documented here. We need to update the clusterrole so that it looks like:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: billing-metrics-reader-cr
rules:
- apiGroups:
  - ""
  resources:
  - namespaces
  verbs:
  - get
  - list
- apiGroups:
  - monitoring.coreos.com
  resources:
  - prometheuses
  - prometheuses/api
  verbs:
  - get
  - list

I've tested this in place; before making the change:

>>> token='...'
>>> url='https://thanos-querier-openshift-monitoring.apps.shift.nerc.mghpcc.org/api/v1/query_range?query=kube_pod_resource_request{unit="cores"} unless on(pod, namespace) kube_pod_status_unschedulable&start=2024-08-14T00:00:00Z&end=2024-08-14T23:59:59Z&step=15m'
>>> requests.get(url, headers={"Authorization": f"Bearer {token}"})
<Response [403]>

After updating the clusterrole (and waiting a bit, there seems to be some sort of latency between making the chance and the change taking effect):

>>> requests.get(url, headers={"Authorization": f"Bearer {token}"})
<Response [200]>
naved001 commented 3 months ago

@larsks Thanks a lot!