Closed naved001 closed 3 months ago
'Forbidden (user=system:serviceaccount:naved-test:metrics-reader, verb=get, resource=prometheuses, subresource=api)\n'
Based on that, I decided to use my account token (which has more permissions than the limited service account token) to query thanos-querier and that worked.
I did try to give a test serviceaccount permissions on the resource "prometheuses" but that didn't work; I will try it again just to make sure that I didn't make a mistake.
Note that all the pods were restarted on Aug 5th (the maintenance window) so I suspect that some update changed the behavior for thanos.
@larsks I created a clusterrole that looks like:
naved@computer ~ % oc get clusterrole billing-metrics-reader-cr -o yaml |oc neat
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: billing-metrics-reader-cr
rules:
- apiGroups:
- ""
resources:
- namespaces
verbs:
- get
- list
- apiGroups:
- monitoring.coreos.com
resources:
- prometheuses
verbs:
- get
- list
A clusterrolebinding that bound a serviceaccount in my test namespace
naved@computer ~ % oc get clusterrolebinding billing-metrics-reader-crb -o yaml |oc neat
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: billing-metrics-reader-crb
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: billing-metrics-reader-cr
subjects:
- kind: ServiceAccount
name: metrics-reader
namespace: naved-test
With the new token from the serviceaccount, I still got the same error:
ipdb> response
<Response [403]>
ipdb> response.text
'Forbidden (user=system:serviceaccount:naved-test:metrics-reader, verb=get, resource=prometheuses, subresource=api)\n'
ipdb>
Here's a URL that I try to get
'https://thanos-querier-openshift-monitoring.apps.shift.nerc.mghpcc.org/api/v1/query_range?query=kube_pod_resource_request{unit="cores"} unless on(pod, namespace) kube_pod_status_unschedulable&start=2024-08-14T00:00:00Z&end=2024-08-14T23:59:59Z&step=15m'
I think the solution is documented here. We need to update the clusterrole so that it looks like:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: billing-metrics-reader-cr
rules:
- apiGroups:
- ""
resources:
- namespaces
verbs:
- get
- list
- apiGroups:
- monitoring.coreos.com
resources:
- prometheuses
- prometheuses/api
verbs:
- get
- list
I've tested this in place; before making the change:
>>> token='...'
>>> url='https://thanos-querier-openshift-monitoring.apps.shift.nerc.mghpcc.org/api/v1/query_range?query=kube_pod_resource_request{unit="cores"} unless on(pod, namespace) kube_pod_status_unschedulable&start=2024-08-14T00:00:00Z&end=2024-08-14T23:59:59Z&step=15m'
>>> requests.get(url, headers={"Authorization": f"Bearer {token}"})
<Response [403]>
After updating the clusterrole (and waiting a bit, there seems to be some sort of latency between making the chance and the change taking effect):
>>> requests.get(url, headers={"Authorization": f"Bearer {token}"})
<Response [200]>
@larsks Thanks a lot!
From slack:
I am getting a 403 when querying data from thanos (thanos-querier-openshift-monitoring.apps.shift.nerc.mghpcc.org) on nerc-ocp-prod.
It stopped working after Aug 5th. This is where I gather the metrics for billing purposes.
Luckily the regular prometheus endpoint is responding correctly (prometheus-k8s-openshift-monitoring.apps.shift.nerc.mghpcc.org), so I just gathered the data from the last week before it is no longer retained.
Response to the query:
Justin looked at the pod logs for thanos-querier:
Thorsten noticed that the prometheus PVC has crossed 85% usage.