Open Kapsztajn opened 1 year ago
Hi @Kapsztajn! The roleAttributePath
field is what allows Grafana to translate the claims in ID tokens issued by your openid provider into the Grafana roles (Admin
, Editor
, Viewer
). It's a bit cryptic but the docs are here for reference. If you check the Grafana pod logs, it should show you relevant error messages.
Regarding what to put for roleAttributePath
, one strategy is to configure your identity provider to attach a custom claim to ID tokens it issues to clients. For example, if you set the claim grafana_role
to an array containing allowed Grafana roles (based on identity provider-specific configuration), your roleAttributePath could be something like: roleAttributePath: "contains(grafana_role[*], 'Admin') && 'Admin' || contains(grafana_role[*], 'Editor') && 'Editor' || 'Viewer'"
Hi @kralicky
I got this error from Grafana pod:
logger=context userId=2 orgId=1 uname=kamil.kwiaton@hostersi.pl t=2023-02-06T22:18:13.216315857Z level=error msg="Internal server error" error="[plugin.downstreamError] failed to query data: received empty response from prometheus" remote_addr=10.1.0.6 traceID= logger=context userId=2 orgId=1 uname=kamil.kwiaton@hostersi.pl t=2023-02-06T22:18:13.21646746Z level=error msg="Request Completed" method=POST path=/api/ds/query status=500 remote_addr=10.1.0.6 time_ms=204 duration=204.862654ms size=116 referer="https://grafana.hidden.tech/d/1e83e204be502391f69d3a826675d3df/infrastructure-overview?orgId=1&refresh=10s" handler=/api/ds/query
Try turning on Grafana debug logs, then inspect the id token it obtains from your openid server. To change the log level, edit the MonitoringCluster
object created by Opni and set
spec:
grafana:
config:
log:
level: debug
The debug logs should show info about the authentication decisions Grafana is making, as well as the id tokens (in plaintext, so redact any secrets before sharing)
Also, check for any unusual logs in the Opni Gateway logs when you log into grafana.
@kralicky Nothing strange in Opni Gateway. I enabled debug log level and got more info from Grafana:
logger=tsdb.prometheus t=2023-02-07T01:36:48.977274558Z level=debug msg="Sending query" start=2023-02-07T00:36:48.321Z end=2023-02-07T01:36:48.321Z step=15s query="label_replace(sum by(namespace, __tenant_id__) (node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate), \"cluster_id\", \"$1\", \"__tenant_id__\", \"(.*)\") * on(cluster_id) group_left(friendly_name) group without(pod, instance) (opni_cluster_info)"
logger=tsdb.prometheus t=2023-02-07T01:36:49.057392077Z level=error msg="Instant query failed" query="label_replace(sum by(namespace, __tenant_id__) (node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate), \"cluster_id\", \"$1\", \"__tenant_id__\", \"(.*)\") * on(cluster_id) group_left(friendly_name) group without(pod, instance) (opni_cluster_info)" err="client_error: client error: 401"
logger=context userId=2 orgId=1 uname=kamil.kwiaton@hostersi.pl t=2023-02-07T01:36:49.057625283Z level=info msg="Request Completed" method=POST path=/api/ds/query status=400 remote_addr=10.1.0.6 time_ms=199 duration=199.445077ms size=62 referer="https://grafana.hidden.tech/d/1e83e204be502391f69d3a826675d3df/infrastructure-overview?orgId=1&refresh=10s" handler=/api/ds/query
logger=tsdb.prometheus t=2023-02-07T01:36:49.057841389Z level=error msg="Range query failed" query="sum(rate(kubelet_runtime_operations_errors_total{job=\"kubelet\",}[2m15s])) by (__tenant_id__, operation_type) * on(__tenant_id__) group_left(friendly_name) label_replace(group without(pod, instance) (opni_cluster_info), \"__tenant_id__\", \"$1\", \"cluster_id\", \"(.*)\")" err="client_error: client error: 401"
logger=context userId=2 orgId=1 uname=kamil.kwiaton@hostersi.pl t=2023-02-07T01:36:49.057919091Z level=info msg="Request Completed" method=POST path=/api/ds/query status=400 remote_addr=10.1.0.6 time_ms=301 duration=301.419074ms size=62 referer="https://grafana.hidden.tech/d/1e83e204be502391f69d3a826675d3df/infrastructure-overview?orgId=1&refresh=10s" handler=/api/ds/query
logger=tsdb.prometheus t=2023-02-07T01:36:49.065603494Z level=debug msg="Sending query" start=2023-02-07T00:36:48.322Z end=2023-02-07T01:36:48.322Z step=15s query="label_replace(sum by(namespace, __tenant_id__) (node_namespace_pod_container:container_memory_rss), \"cluster_id\", \"$1\", \"__tenant_id__\", \"(.*)\") * on(cluster_id) group_left(friendly_name) group without(pod, instance) (opni_cluster_info)"
logger=tsdb.prometheus t=2023-02-07T01:36:49.067280439Z level=error msg="Instant query failed" query="label_replace(sum by(namespace, __tenant_id__) (node_namespace_pod_container:container_memory_rss), \"cluster_id\", \"$1\", \"__tenant_id__\", \"(.*)\") * on(cluster_id) group_left(friendly_name) group without(pod, instance) (opni_cluster_info)" err="client_error: client error: 401"
logger=context userId=2 orgId=1 uname=kamil.kwiaton@hostersi.pl t=2023-02-07T01:36:49.067361541Z level=info msg="Request Completed" method=POST path=/api/ds/query status=400 remote_addr=10.1.0.6 time_ms=160 duration=160.599048ms size=62 referer="https://grafana.hidden.tech/d/1e83e204be502391f69d3a826675d3df/infrastructure-overview?orgId=1&refresh=10s" handler=/api/ds/query
This 401 errors is curious to me. Can I set somehow in Grafana config that all users are by default Admin level not Viewer? When I try to edit configmap directly it won't change. Maybe this is an issue that I can't configure roleAttributePath.
Is Role Binding correctly setup with my email if I have identifyingClaim as email?
Yeah looks like it could be an rbac issue. If your rbac is set up such that a given user has access to no clusters, it might return a 401 error when querying metrics.
If identifyingClaim is email, check that the Subject in your role binding is kamil.kwiaton@hostersi.pl
. You can exec into the gateway pod and run opni access-matrix
to show the full permissions table for all defined users.
You might also need to adjust roleAttributePath, but technically it shouldn't stop you from viewing metrics if you aren't an admin.
bash-5.1# opni access-matrix
TENANT ID 76dcd1c7-e589-489c-9bca-f0552fcf2175 kamil.kwiaton@hostersi.pl
0d0d1d66-a94b-40b4-90d4-8c2af0082d21 ✅ ✅
2991968b-91b7-4b33-9566-4469a5f494a0 ✅ ✅
40ea968b-86a0-4e26-af63-b5f3a0df04a7 ✅ ✅
4cc83bd0-4a8d-4b97-a40c-03eda158c32e ✅ ✅
I checked the access matrix and I have my email with all clusters assigned. I also tried with this 76dcd1c7-e589-489c-9bca-f0552fcf2175
after changing values.yaml to identifyingClaim: "oid"
as this string is my account identifier in AzureAD but the same 401 error in grafana
grafana logger=tsdb.prometheus t=2023-02-10T13:20:48.164082341Z level=error msg="Range query failed" query="1 - (:node_memory_MemAvailable_bytes:sum / on(__tenant_id__) sum by(__tenant_id__) (node_memory_MemTotal_bytes)) * on(__tenant │
│ _id__) group_left(friendly_name) label_replace(group without(pod, instance) (opni_cluster_info), \"__tenant_id__\", \"$1\", \"cluster_id\", \"(.*)\")" err="client_error: client error: 401" │
│ grafana logger=auth t=2023-02-10T13:20:48.164133442Z level=debug msg="token needs rotation" tokenId=2 authTokenSeen=true rotatedAt=2023-02-10T13:10:48Z │
│ grafana logger=tsdb.prometheus t=2023-02-10T13:20:48.177085407Z level=debug msg="Sending query" start=2023-02-10T12:20:47.597Z end=2023-02-10T13:20:47.597Z step=2m0s query="sum(rate(kubelet_runtime_operations_errors_total{job=\"kubele │
│ t\",}[2m15s])) by (__tenant_id__, operation_type) * on(__tenant_id__) group_left(friendly_name) label_replace(group without(pod, instance) (opni_cluster_info), \"__tenant_id__\", \"$1\", \"cluster_id\", \"(.*)\")" │
│ grafana logger=auth t=2023-02-10T13:20:48.178365643Z level=debug msg="auth token rotated" affected=1 auth_token_id=2 userId=2 │
│ grafana logger=context userId=2 orgId=1 uname=kamil.kwiaton@hostersi.pl t=2023-02-10T13:20:48.178461545Z level=info msg="Request Completed" method=POST path=/api/ds/query status=400 remote_addr=10.1.0.5 time_ms=422 duration=422.060876 │
│ ms size=112 referer="https://grafana.hidden.tech/d/1e83e204be502391f69d3a826675d3df/infrastructure-overview?orgId=1&refresh=10s" handler=/api/ds/query │
│ grafana logger=tsdb.prometheus t=2023-02-10T13:20:48.178873957Z level=error msg="Range query failed" query="sum(rate(kubelet_runtime_operations_errors_total{job=\"kubelet\",}[2m15s])) by (__tenant_id__, operation_type) * on(__tenant_i │
│ d__) group_left(friendly_name) label_replace(group without(pod, instance) (opni_cluster_info), \"__tenant_id__\", \"$1\", \"cluster_id\", \"(.*)\")" err="client_error: client error: 401" │
│ grafana logger=auth t=2023-02-10T13:20:48.178933259Z level=debug msg="token needs rotation" tokenId=2 authTokenSeen=true rotatedAt=2023-02-10T13:10:48Z
│ grafana logger=context userId=2 orgId=1 uname=kamil.kwiaton@hostersi.pl t=2023-02-10T13:20:48.179211566Z level=info msg="Request Completed" method=POST path=/api/ds/query status=400 remote_addr=10.1.0.5 time_ms=122 duration=122.822254 │
│ ms size=62 referer="https://grafana.hidden.tech/d/1e83e204be502391f69d3a826675d3df/infrastructure-overview?orgId=1&refresh=10s" handler=/api/ds/query │
│ grafana logger=auth t=2023-02-10T13:20:48.183561189Z level=debug msg="auth token rotated" affected=0 auth_token_id=2 userId=2
For me, it looks like identifyingClaim: "email"
or any other value here is not working correctly with AAD. In what openid provider you tested this so maybe I will change Azure AD to that?
Some more log during login to Grafana:
grafana logger=oauth.generic_oauth t=2023-02-10T15:10:19.512778773Z level=debug msg="Getting user info"
grafana logger=oauth.generic_oauth t=2023-02-10T15:10:19.512785973Z level=debug msg="Extracting user info from OAuth token"
grafana logger=oauth.generic_oauth t=2023-02-10T15:10:19.513107373Z level=debug msg="Received id_token" raw_json="{\"aud\":\"0d70811a-01a8-4908-8066-9e74edcc2f59\",\"iss\":\"https://login.microsoftonline.com/HIDDEN/v2.0\",\"iat\":1676041519,\"nbf\":1676041519,\"exp\":1676045419,\"aio\":\"AeQAG/8TAAAAO0oBDzXhPRggyL2Ij6raCr2T06IBqJAEtRZuLM/7gFw9aR+F8O2NicMXUwTiCubtmjdDcJDz1Y3UcCPr6kG2RbR815ZEKTu7RK1dBw2cYcA5xbFYbyGNP3SoqeLf+UMj4rCJsfFi5
U0stvVvoqZQwol1Nci6cqc43ODeRGQcbO+ynda/oF1LOqHZZvxEOpiga5PZTYlAJX42TrVJES6n3Cr44Kod5wjG7JYyH8uNJMFLEixGRfCw8qyigw7KwgWgE7tRNPscV2sKow5xYeIEb4M7/l4QJSDBkXoQUgOYqTQ=\",\"email\":\"kamil.kwiaton@hostersi.pl\",\"idp\":\"https://sts.window
s.net/b37f6912-3cfa-4041-b867-5ff20368f029/\",\"name\":\"Kamil Kwiaton\",\"oid\":\"76dcd1c7-e589-489c-9bca-f0552fcf2175\",\"preferred_username\":\"kamil.kwiaton@hostersi.pl\",\"rh\":\"0.AXkAiUIh5jPdcEyDPIUi4d_s0xqBcA2oAQhJgGaedO3ML1l5
AKc.\",\"sub\":\"d75bDesL3N2B_LM-OaP_5AyrAv5k4Gl8t9K8hHj63q8\",\"tid\":\"e6214289-dd33-4c70-833c-8522e1dfecd3\",\"uti\":\"d8kVk9waS0-PL8MuCkQUAA\",\"ver\":\"2.0\"}" data="Name: Kamil Kwiaton, Displayname: , Login: , Username: , Email:
kamil.kwiaton@hostersi.pl, Upn: , Attributes: map[]"
grafana logger=oauth.generic_oauth t=2023-02-10T15:10:19.513133173Z level=debug msg="Getting user info from API"
grafana logger=oauth.generic_oauth t=2023-02-10T15:10:19.588388838Z level=debug msg="HTTP GET" url=https://graph.microsoft.com/oidc/userinfo status="200 OK" response_body="{\"sub\":\"d75bDesL3N2B_LM-OaP_5AyrAv5k4Gl8t9K8hHj63q8\",\"nam
e\":\"Kamil Kwiaton\",\"family_name\":\"Kwiaton\",\"given_name\":\"Kamil\",\"picture\":\"https://graph.microsoft.com/v1.0/me/photo/$value\",\"email\":\"kamil.kwiaton@hostersi.pl\"}"
grafana logger=oauth.generic_oauth t=2023-02-10T15:10:19.588453138Z level=debug msg="Received user info response from API" raw_json="{\"sub\":\"d75bDesL3N2B_LM-OaP_5AyrAv5k4Gl8t9K8hHj63q8\",\"name\":\"Kamil Kwiaton\",\"family_name\":\
"Kwiaton\",\"given_name\":\"Kamil\",\"picture\":\"https://graph.microsoft.com/v1.0/me/photo/$value\",\"email\":\"kamil.kwiaton@hostersi.pl\"}" data="Name: Kamil Kwiaton, Displayname: , Login: , Username: , Email: kamil.kwiaton@hosters
i.pl, Upn: , Attributes: map[]"
grafana logger=oauth.generic_oauth t=2023-02-10T15:10:19.588464738Z level=debug msg="Processing external user info" source=token data="Name: Kamil Kwiaton, Displayname: , Login: , Username: , Email: kamil.kwiaton@hostersi.pl, Upn: , A
ttributes: map[]"
grafana logger=oauth.generic_oauth t=2023-02-10T15:10:19.588475638Z level=debug msg="Setting user info name from name field"
grafana logger=oauth.generic_oauth t=2023-02-10T15:10:19.588482438Z level=debug msg="Set user info email from extracted email" email=kamil.kwiaton@hostersi.pl
grafana logger=oauth.generic_oauth t=2023-02-10T15:10:19.588564638Z level=warn msg="No valid role found. Skipping role sync. In Grafana 10, this will result in the user being assigned the default role and overriding manual assignment.
If role sync is not desired, set oauth_skip_org_role_update_sync to false"
grafana logger=oauth.generic_oauth t=2023-02-10T15:10:19.588575338Z level=debug msg="Processing external user info" source=API data="Name: Kamil Kwiaton, Displayname: , Login: , Username: , Email: kamil.kwiaton@hostersi.pl, Upn: , Att
ributes: map[]"
grafana logger=oauth.generic_oauth t=2023-02-10T15:10:19.588612438Z level=warn msg="No valid role found. Skipping role sync. In Grafana 10, this will result in the user being assigned the default role and overriding manual assignment.
If role sync is not desired, set oauth_skip_org_role_update_sync to false"
grafana logger=oauth.generic_oauth t=2023-02-10T15:10:19.588620938Z level=debug msg="Defaulting to using email for user info login" email=kamil.kwiaton@hostersi.pl
As you can see atributes from Azure AD are passed correctly, sub, email etc.
Logs from opni gateway when I login to grafana:
2023-02-10T15:25:18.570Z DEBUG x17 api fwd/forwarder.go:100 => {"method": "POST", "path": "/api/prom/api/v1/query", "to": "127.0.0.1:40807 (plugin_metrics)", "for": "10.244.2.63", "host": "opni-internal.opni.svc:8080", "scheme": "https"}
Grafana appears to be working so far. What does your auth configuration look like? Look for a section like this in the opni-gateway configmap:
auth:
provider: openid
openid:
discovery:
issuer: https://xxx/
identifyingClaim: email
clientID: xxx
clientSecret: xxx
scopes: ["openid", "profile", "email"]
roleAttributePath: "contains(opni_grafana_role[*], 'Admin') && 'Admin' || contains(opni_grafana_role[*], 'Editor') && 'Editor' || 'Viewer'"
I think in opni-gateway configmap I have only this part:
---
apiVersion: v1beta1
kind: AuthProvider
metadata:
name: openid
spec:
options:
discovery:
issuer: https://login.microsoftonline.com/tenant_id/v2.0
path: /.well-known/openid-configuration
identifyingClaim: sub
type: openid
I have set identifyingClaim to sub again for testing, I can switch it back to email.
Auth is in one more place in that configmap but only one line:
Data ==== config.yaml: │
----
apiVersion: v1beta1
kind: GatewayConfig
spec:
alerting:
Namespace: opni
configMap: alertmanager-config
controllerNodeService: opni-alerting-controller
controllerStatefulSet: opni-alerting-controller-internal
workerNodeService: opni-alerting
workerStatefulSet: opni-alerting-internal
authProvider: openid
certs:
To confirm, if you go to https://login.microsoftonline.com/<your_id>/v2.0/.well-known/openid-configuration
, everything look ok there?
Also yeah your configmap looks correct, I copied the wrong one earlier. The one I pasted should be in the Gateway custom resource, and only some of the fields are copied into the configmap (only the ones needed to verify id tokens)
I think yes?
This is the default link MS provides: https://login.microsoftonline.com/common/v2.0/.well-known/openid-configuration
Theoretically, I can configure that myself in wellKnownConfiguration if you think that would help.
Looks correct to me. Can you check the logs for the cortex-querier
pods for any auth related errors?
You can also get some additional status info by running opni metrics admin status
, opni metrics admin list-clusters
, opni metrics admin storage-info <cluster id>
from a shell inside the opni-gateway pod.
I don't see a cortex-querier pod at all only cortex-all-0
If you install metrics in standalone mode you'll only get one pod, that's normal. Do you see any interesting logs in Cortex when grafana sends queries?
I decided to reinstall the whole Opni cause I had an old version of 0.6.3 and didn't bother with the upgrade. Problem persists.
opni metrics admin status
bash-5.1# opni metrics admin status
Cortex Services
compactor distributor-service ingester-service memberlist-kv querier query-frontend query-frontend-tripperware ring ruler runtime-config server store-gateway store-queryable
Compactor Running Running Running Running Running Running Running Running Running Running Running Running Running
Distributor Running Running Running Running Running Running Running Running Running Running Running Running Running
Ingester Running Running Running Running Running Running Running Running Running Running Running Running Running
Purger Running Running Running Running Running Running Running Running Running Running Running Running Running
Querier Running Running Running Running Running Running Running Running Running Running Running Running Running
Ruler Running Running Running Running Running Running Running Running Running Running Running Running Running
Store Gateway Running Running Running Running Running Running Running Running Running Running Running Running Running
Ingester Ring
ID STATE ADDRESS TIMESTAMP
cortex-all-0 ACTIVE 10.244.2.117:9095 2023-02-13 12:12:39 +0000 UTC
Ruler Ring
ID STATE ADDRESS TIMESTAMP
cortex-all-0 ACTIVE 10.244.2.117:9095 2023-02-13 12:12:39 +0000 UTC
opni metrics admin list-clusters
bash-5.1# opni metrics admin list-clusters
ID LABELS CAPABILITIES STATUS NUM SERIES SAMPLE RATE RULE RATE
2991968b-91b7-4b33-9566-4469a5f494a0 opni.io/name=hidden,opni.io/agent-version=v2 metrics Healthy 62736 2160.5/s 30.7/s
40ea968b-86a0-4e26-af63-b5f3a0df04a7 opni.io/name=dev-test-aks,cluster=hidden,opni.io/agent-version=v2 metrics Healthy 123271 12277.2/s 54.4/s
opni metrics admin storage-info
bash-5.1# opni metrics admin storage-info 2991968b-91b7-4b33-9566-4469a5f494a0
bash-5.1# opni metrics admin storage-info 40ea968b-86a0-4e26-af63-b5f3a0df04a7
NAMESPACE CLUSTER BLOCKS
Logs from opni-gateway when I log in:
│ 2023-02-13T12:50:04Z DEBUG apiext management/extensions.go:236 handling http request {"method": "GetClusterStatus", "path": "/status"} │
│ 2023-02-13T12:50:29Z DEBUG x16 api fwd/forwarder.go:100 => {"method": "POST", "path": "/api/prom/api/v1/query", "to": "127.0.0.1:32895 (plugin_metrics)", "for": "10.244.2.118", "host": "opni-internal.opni.svc:8080", "scheme": "https"} │
│ 2023-02-13T12:50:36Z DEBUG x17 api fwd/forwarder.go:100 => {"method": "POST", "path": "/api/prom/api/v1/query_range", "to": "127.0.0.1:32895 (plugin_metrics)", "for": "10.244.2.118", "host": "opni-internal.opni.svc:8080", "scheme": "h │
│ 2023-02-13T12:50:40Z DEBUG gateway.sync gateway/sync.go:86 sending sync request to agent {"agentId": "2991968b-91b7-4b33-9566-4469a5f494a0", "capabilities": []} │
│ 2023-02-13T12:50:45Z INFO plugin.logging.opensearch-manager gateway/admin_v2.go:971 waiting for k8s object │
│ 2023-02-13T12:50:45Z INFO plugin.modeltraining gateway/system.go:67 waiting for k8s object │
│ 2023-02-13T12:50:46Z DEBUG x17 api fwd/forwarder.go:100 => {"method": "POST", "path": "/api/prom/api/v1/query", "to": "127.0.0.1:32895 (plugin_metrics)", "for": "10.244.2.118", "host": "opni-internal.opni.svc:8080", "scheme": "https"}
There are no logs in cortex-all when I log in to Grafana. Only some post opni-alerting-controller
Can I test some other identity provider which you tested that works? Maybe there is some issue with AzureAD and Opni together?
I've tested auth0 recently and confirmed that works. Unfortunately I don't have access to any Azure AD setups so I can't test that, but if there is a bug in opni preventing it from working I want to make sure we fix it. Theoretically there shouldn't be anything preventing it from working, as long as it conforms to the openid standards.
If you want, you can join the rancher-users slack (link) and I could help you troubleshoot over a call. Otherwise there are a few other things you can try:
opni metrics admin query --clusters=all "any promql query"
(try "up")opni metrics ops configure --mode=HighlyAvailable --storage.backend=azure --storage.azure.account-key=xxx --storage.azure.account-name=xxx --storage.azure.container-name=xxx
After working with @Kapsztajn to debug this issue, we discovered that Azure AD might not be OIDC compliant. Will follow up in this thread: https://github.com/MicrosoftDocs/azure-docs/issues/38427
Hi, I'm currently trying to migrate from noauth to openid configuration in Grafana from Opni, but I'm having some difficulties with accessing cluster information in Grafana. Here is my value file which I use to helm:
I already added cluster to Opni with monitoring which work with noauth:![image](https://user-images.githubusercontent.com/23492161/217093322-88d501e9-d54c-4f96-882e-97b6f3a39d5d.png)
I have configured Roles and Role binding:![image](https://user-images.githubusercontent.com/23492161/217093718-d05dbb68-61fd-4030-bf7a-5c6304e7f4ff.png)
Still when I login to Grafana I'm getting errors and cannot see anything:
Am I doing something wrong or did I miss something? Also I'm not really sure what roleAttributePath does? What values should I provide there to get highest permissions?
Thanks for this tool and your time.