Closed vijay-wandb closed 1 month ago
@vijay-wandb - Can you please clarify the issue in this request as per the discussion in this week's operator meeting?
@abhinavg6 I have updated the request
As per @danielpanzella - we can remove these permissions from the console as these are not used. Console uses prometheus to get the required data.
ClusterRole wandb-otel-daemonset
requires access to nodes/spec, nodes, nodes/stat
{"level":"error","ts":"2024-09-30T14:59:41Z","msg":"Failed to apply config changes.","controller":"weightsandbiases","controllerGroup":"apps.wandb.com","controllerKind":"WeightsAndBiases","WeightsAndBiases":{"name":"wandb","namespace":"test2"},"namespace":"test2","name":"wandb","reconcileID":"f81d3c25-5c7c-4789-8ca9-18880416623f","error":"clusterroles.rbac.authorization.k8s.io \"wandb-otel-daemonset\" is forbidden: user \"system:serviceaccount:test2:wandb-manager\" (groups=[\"system:serviceaccounts\" \"system:serviceaccounts:test2\" \"system:authenticated\"]) is attempting to grant RBAC permissions not currently held:\n{APIGroups:[\"\"], Resources:[\"nodes\"], Verbs:[\"get\" \"list\" \"watch\"]}\n{APIGroups:[\"\"], Resources:[\"nodes/spec\"], Verbs:[\"get\" \"list\" \"watch\"]}\n{APIGroups:[\"\"], Resources:[\"nodes/stats\"], Verbs:[\"get\" \"watch\" \"list\"]}","stacktrace":"github.com/wandb/operator/controllers.(*WeightsAndBiasesReconciler).Reconcile\n\t/workspace/controllers/weightsandbiases_controller.go:249\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.1/pkg/internal/controller/controller.go:122\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.1/pkg/internal/controller/controller.go:323\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.1/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.1/pkg/internal/controller/controller.go:235"}
Vijay to retest after disabling the Otel on the cluster
@vijay-wandb - Were you able to retest?
@abhinavg6 @velotioaastha
I tested with otel disabled, which as expected, resolved the previous error message about ClusterRole wandb-otel-daemonset
needing access to nodes/spec, nodes, nodes/stat
However wandb-console
still requires access to nodes
resource. Error message below.
{"level":"error","ts":"2024-10-02T20:17:47Z","msg":"Failed to apply config changes.","controller":"weightsandbiases","controllerGroup":"apps.wandb.com","controllerKind":"WeightsAndBiases","WeightsAndBiases":{"name":"wandb","namespace":"test4"},"namespace":"test4","name":"wandb","reconcileID":"45154c66-1f24-4ae7-96f0-11db9712dd5f","error":"clusterroles.rbac.authorization.k8s.io \"wandb-console\" is forbidden: user \"system:serviceaccount:test4:wandb-manager\" (groups=[\"system:serviceaccounts\" \"system:serviceaccounts:test4\" \"system:authenticated\"]) is attempting to grant RBAC permissions not currently held:\n{APIGroups:[\"\"], Resources:[\"nodes\"], Verbs:[\"get\" \"list\"]}","stacktrace":"github.com/wandb/operator/controllers.(*WeightsAndBiasesReconciler).Reconcile\n\t/workspace/controllers/weightsandbiases_controller.go:249\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.1/pkg/internal/controller/controller.go:122\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.1/pkg/internal/controller/controller.go:323\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.1/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.1/pkg/internal/controller/controller.go:235"}
@vijay-wandb - Are you testing with the latest, like operator-wandb chart 0.18.1+?
I tested with the most recent Operator v1.14.0.
..those errors are from the operator log, so the issue has to do with operator
not operator-wandb
vijayp@Vijay-Panneerselvam-MR6QDWQJ3V test4 % k get pod wandb-controller-manager-69d558d5fb-89x4x -o yaml | grep -i "sha256"
imageID: docker.io/wandb/controller@sha256:5505566eceb90fb208d97021e8a1157d7546ad31ed3937b19944891de300110f
operator-wandb
hey @vijay-wandb , so you mean we need access to nodes resource. That. means, we have to revert this PR changes.
No, that is not what I'm saying. wandb-console
should not require access to nodes
resource. According to Daniel, console uses prometheus to get the neccessary data.
To meet a customer need, I removed the following resources from the wandb-manager-role
role which seems to be propagated to wandb-console
role. We shouldn't see the error, as console shouldn't need access to nodes
resource.
Here are the removed resources:
- nodes
- nodes/metrics
- nodes/spec
- nodes/stats
- nodes/proxy
\"wandb-console\" is forbidden: user \"system:serviceaccount:test4:wandb-manager\" (groups=[\"system:serviceaccounts\" \"system:serviceaccounts:test4\" \"system:authenticated\"]) is attempting to grant RBAC permissions not currently held:\n{APIGroups:[\"\"], Resources:[\"nodes\"], Verbs:[\"get\" \"list\"]}",
cc @danielpanzella if you've thoughts on this?
Hi @vijay-wandb ,
So far we tried below things
Steps :
` - nodes
wandb-app logs:
{"level":"INFO","time":"2024-10-07T07:26:18.708617799Z","info":{"program":"gorilla","source":"github.com/wandb/core/services/gorilla/api/handler/relay.go:263","pid":1062},"data":{"dd.service":"gorilla","dd.version":"84e77719c0b18a6ee8ba2750b4f9abad7fc12286","authUser":"aman","userID":6,"authUser":"aman","userID":6,"operationName":"SecureStorageConnectorEnabled","withAdminPrivileges":true,"defaultEntityID":6,"latencyNs":477712,"statusCode":200,"operationName":"SecureStorageConnectorEnabled","authUser":"aman","variables":{},"latencyStr":"477.712µs"},"message":"Graphql operation SecureStorageConnectorEnabled for user aman with variables map[] finished in 477.712µs","dd.trace_id":"10223050220714882791"}
wandb-console logs:
`> console@0.1.0 job /app
cross-env NODE_ENV=production node -r ts-node/register/transpile-only -r tsconfig-paths/register dist/jobs.js "upsert-password"
2024-10-07 07:17:33 [42] [info]: Running job: upsert-password 2024-10-07 07:17:33 [42] [info]: Job upsert-password completed ▲ Next.js 14.1.0
Local: http://localhost:8082/
✓ Ready in 1039ms`
Let us know where exactly you are facing this issue?
@amanpruthi
The error happens when the controller
deploys the operator-wandb
helm charts.
To reproduce that issue, you need to start from scratch.
1) Extract the manifests:
helm template operator wandb/operator --namespace test --output-dir ./test
2) cd
to the folders containing the mainifests
3) Edit wandb-manager-role Clusterrole by removing below permissions
- nodes - nodes/metrics - nodes/spec - nodes/stats - nodes/proxy
4) From that folder, apply the manifests:
k apply -f .
5) Then you should see the error in the controller
logs
I tested with the most recent Operator v1.14.0...those errors are from the operator log, so the issue has to do with operator not operator-wandb
@vijay-wandb This is an incorrect statement. The issue is that you manually removed permissions from operator's role, so it can now not grant those permissions to the wandb-console
role. @amanpruthi removed the node permissions request from operator-wandb here, can you please confirm you are testing with operator-wandb
chart later than 0.18.1
As discussed with @vijay-wandb the issue is now solved and we are closing this
Thank you very much Aman and Aastha for hopping on a call and resolving the problem.
The release channel is mapped to operator-wandb-0.17.9
, and my testing used this default chart. I'm good now after manually pointing to the latest chart.
The
wandb-manager
role requires access to the following resources:But FORD's central infra team restricts access to these resources for individual teams due to security policies.
Here is the error message from the wandb controller pod log:
{"level":"error","ts":"2024-08-28T17:26:03Z","msg":"Failed to apply config changes.","controller":"weightsandbiases","controllerGroup":"[apps.wandb.com](http://apps.wandb.com/)","controllerKind":"WeightsAndBiases","WeightsAndBiases":{"name":"wandb","namespace":"gdia-wandb"},"namespace":"gdia-wandb","name":"wandb","reconcileID":"5dce1ea1-c190-41af-8d20-086c00dcb4aa","error":"[roles.rbac.authorization.k8s.io](http://roles.rbac.authorization.k8s.io/) \"wandb-console\" is forbidden: user \"system:serviceaccount:gdia-wandb:wandb-manager\" (groups=[\"system:serviceaccounts\" \"system:serviceaccounts:gdia-wandb\" \"system:authenticated\"]) is attempting to grant RBAC permissions not currently held:\n{APIGroups:[\"\"], Resources:[\"nodes\"], Verbs:[\"get\" \"list\"]}","stacktrace":"[github.com/wandb/operator/controllers.(*WeightsAndBiasesReconciler).Reconcile\n\t/workspace/controllers/weightsandbiases_controller.go:193\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.1/pkg/internal/controller/controller.go:122\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.1/pkg/internal/controller/controller.go:323\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.1/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.1/pkg/internal/controller/controller.go:235](http://github.com/wandb/operator/controllers.(*WeightsAndBiasesReconciler).Reconcile%5Cn%5Ct/workspace/controllers/weightsandbiases_controller.go:193%5Cnsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile%5Cn%5Ct/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.1/pkg/internal/controller/controller.go:122%5Cnsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler%5Cn%5Ct/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.1/pkg/internal/controller/controller.go:323%5Cnsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem%5Cn%5Ct/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.1/pkg/internal/controller/controller.go:274%5Cnsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2%5Cn%5Ct/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.1/pkg/internal/controller/controller.go:235)"}
Is it possible to introduce an option to disable node-level metrics collection? Enabling this option would prevent the wandb-console from attempting to collect or display node-level metrics, allowing the Operator to work in environments like Ford's where access to node info is restricted.