wandb / operator

1 stars 0 forks source link

Operator Feature Request | Disable permissions for node-level metrics and logs in wandb-console #19

Closed vijay-wandb closed 1 month ago

vijay-wandb commented 3 months ago

The wandb-manager role requires access to the following resources:

nodes
nodes/metrics
nodes/spec
nodes/stats
nodes/proxy

But FORD's central infra team restricts access to these resources for individual teams due to security policies.

Here is the error message from the wandb controller pod log:

{"level":"error","ts":"2024-08-28T17:26:03Z","msg":"Failed to apply config changes.","controller":"weightsandbiases","controllerGroup":"[apps.wandb.com](http://apps.wandb.com/)","controllerKind":"WeightsAndBiases","WeightsAndBiases":{"name":"wandb","namespace":"gdia-wandb"},"namespace":"gdia-wandb","name":"wandb","reconcileID":"5dce1ea1-c190-41af-8d20-086c00dcb4aa","error":"[roles.rbac.authorization.k8s.io](http://roles.rbac.authorization.k8s.io/) \"wandb-console\" is forbidden: user \"system:serviceaccount:gdia-wandb:wandb-manager\" (groups=[\"system:serviceaccounts\" \"system:serviceaccounts:gdia-wandb\" \"system:authenticated\"]) is attempting to grant RBAC permissions not currently held:\n{APIGroups:[\"\"], Resources:[\"nodes\"], Verbs:[\"get\" \"list\"]}","stacktrace":"[github.com/wandb/operator/controllers.(*WeightsAndBiasesReconciler).Reconcile\n\t/workspace/controllers/weightsandbiases_controller.go:193\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.1/pkg/internal/controller/controller.go:122\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.1/pkg/internal/controller/controller.go:323\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.1/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.1/pkg/internal/controller/controller.go:235](http://github.com/wandb/operator/controllers.(*WeightsAndBiasesReconciler).Reconcile%5Cn%5Ct/workspace/controllers/weightsandbiases_controller.go:193%5Cnsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile%5Cn%5Ct/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.1/pkg/internal/controller/controller.go:122%5Cnsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler%5Cn%5Ct/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.1/pkg/internal/controller/controller.go:323%5Cnsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem%5Cn%5Ct/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.1/pkg/internal/controller/controller.go:274%5Cnsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2%5Cn%5Ct/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.1/pkg/internal/controller/controller.go:235)"}

Is it possible to introduce an option to disable node-level metrics collection? Enabling this option would prevent the wandb-console from attempting to collect or display node-level metrics, allowing the Operator to work in environments like Ford's where access to node info is restricted.

abhinavg6 commented 2 months ago

@vijay-wandb - Can you please clarify the issue in this request as per the discussion in this week's operator meeting?

vijay-wandb commented 2 months ago

@abhinavg6 I have updated the request

abhinavg6 commented 2 months ago

As per @danielpanzella - we can remove these permissions from the console as these are not used. Console uses prometheus to get the required data.

amanpruthi commented 2 months ago

https://github.com/wandb/helm-charts/pull/219

vijay-wandb commented 2 months ago

ClusterRole wandb-otel-daemonset requires access to nodes/spec, nodes, nodes/stat

{"level":"error","ts":"2024-09-30T14:59:41Z","msg":"Failed to apply config changes.","controller":"weightsandbiases","controllerGroup":"apps.wandb.com","controllerKind":"WeightsAndBiases","WeightsAndBiases":{"name":"wandb","namespace":"test2"},"namespace":"test2","name":"wandb","reconcileID":"f81d3c25-5c7c-4789-8ca9-18880416623f","error":"clusterroles.rbac.authorization.k8s.io \"wandb-otel-daemonset\" is forbidden: user \"system:serviceaccount:test2:wandb-manager\" (groups=[\"system:serviceaccounts\" \"system:serviceaccounts:test2\" \"system:authenticated\"]) is attempting to grant RBAC permissions not currently held:\n{APIGroups:[\"\"], Resources:[\"nodes\"], Verbs:[\"get\" \"list\" \"watch\"]}\n{APIGroups:[\"\"], Resources:[\"nodes/spec\"], Verbs:[\"get\" \"list\" \"watch\"]}\n{APIGroups:[\"\"], Resources:[\"nodes/stats\"], Verbs:[\"get\" \"watch\" \"list\"]}","stacktrace":"github.com/wandb/operator/controllers.(*WeightsAndBiasesReconciler).Reconcile\n\t/workspace/controllers/weightsandbiases_controller.go:249\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.1/pkg/internal/controller/controller.go:122\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.1/pkg/internal/controller/controller.go:323\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.1/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.1/pkg/internal/controller/controller.go:235"}

abhinavg6 commented 2 months ago

Vijay to retest after disabling the Otel on the cluster

abhinavg6 commented 1 month ago

@vijay-wandb - Were you able to retest?

vijay-wandb commented 1 month ago

@abhinavg6 @velotioaastha

I tested with otel disabled, which as expected, resolved the previous error message about ClusterRole wandb-otel-daemonset needing access to nodes/spec, nodes, nodes/stat

However wandb-console still requires access to nodes resource. Error message below.

{"level":"error","ts":"2024-10-02T20:17:47Z","msg":"Failed to apply config changes.","controller":"weightsandbiases","controllerGroup":"apps.wandb.com","controllerKind":"WeightsAndBiases","WeightsAndBiases":{"name":"wandb","namespace":"test4"},"namespace":"test4","name":"wandb","reconcileID":"45154c66-1f24-4ae7-96f0-11db9712dd5f","error":"clusterroles.rbac.authorization.k8s.io \"wandb-console\" is forbidden: user \"system:serviceaccount:test4:wandb-manager\" (groups=[\"system:serviceaccounts\" \"system:serviceaccounts:test4\" \"system:authenticated\"]) is attempting to grant RBAC permissions not currently held:\n{APIGroups:[\"\"], Resources:[\"nodes\"], Verbs:[\"get\" \"list\"]}","stacktrace":"github.com/wandb/operator/controllers.(*WeightsAndBiasesReconciler).Reconcile\n\t/workspace/controllers/weightsandbiases_controller.go:249\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.1/pkg/internal/controller/controller.go:122\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.1/pkg/internal/controller/controller.go:323\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.1/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.1/pkg/internal/controller/controller.go:235"}

abhinavg6 commented 1 month ago

@vijay-wandb - Are you testing with the latest, like operator-wandb chart 0.18.1+?

vijay-wandb commented 1 month ago

I tested with the most recent Operator v1.14.0...those errors are from the operator log, so the issue has to do with operator not operator-wandb

vijayp@Vijay-Panneerselvam-MR6QDWQJ3V test4 % k get pod wandb-controller-manager-69d558d5fb-89x4x -o yaml | grep -i "sha256"
    imageID: docker.io/wandb/controller@sha256:5505566eceb90fb208d97021e8a1157d7546ad31ed3937b19944891de300110f

image

velotioaastha commented 1 month ago

operator-wandb

hey @vijay-wandb , so you mean we need access to nodes resource. That. means, we have to revert this PR changes.

vijay-wandb commented 1 month ago

No, that is not what I'm saying. wandb-console should not require access to nodes resource. According to Daniel, console uses prometheus to get the neccessary data.

To meet a customer need, I removed the following resources from the wandb-manager-role role which seems to be propagated to wandb-console role. We shouldn't see the error, as console shouldn't need access to nodes resource.

Here are the removed resources:

    - nodes
    - nodes/metrics
    - nodes/spec
    - nodes/stats
    - nodes/proxy

\"wandb-console\" is forbidden: user \"system:serviceaccount:test4:wandb-manager\" (groups=[\"system:serviceaccounts\" \"system:serviceaccounts:test4\" \"system:authenticated\"]) is attempting to grant RBAC permissions not currently held:\n{APIGroups:[\"\"], Resources:[\"nodes\"], Verbs:[\"get\" \"list\"]}",

abhinavg6 commented 1 month ago

cc @danielpanzella if you've thoughts on this?

amanpruthi commented 1 month ago

Hi @vijay-wandb ,

So far we tried below things

Steps :

  1. Deployed test cluster and helm charts
  2. Edited wandb-manager-role Clusterrole by removing below permission and restarted wandb-controller

` - nodes

  1. Then restart wandb-console
  2. No error reported in console logs and wandb app.

wandb-app logs:

{"level":"INFO","time":"2024-10-07T07:26:18.708617799Z","info":{"program":"gorilla","source":"github.com/wandb/core/services/gorilla/api/handler/relay.go:263","pid":1062},"data":{"dd.service":"gorilla","dd.version":"84e77719c0b18a6ee8ba2750b4f9abad7fc12286","authUser":"aman","userID":6,"authUser":"aman","userID":6,"operationName":"SecureStorageConnectorEnabled","withAdminPrivileges":true,"defaultEntityID":6,"latencyNs":477712,"statusCode":200,"operationName":"SecureStorageConnectorEnabled","authUser":"aman","variables":{},"latencyStr":"477.712µs"},"message":"Graphql operation SecureStorageConnectorEnabled for user aman with variables map[] finished in 477.712µs","dd.trace_id":"10223050220714882791"}

wandb-console logs:

`> console@0.1.0 job /app

cross-env NODE_ENV=production node -r ts-node/register/transpile-only -r tsconfig-paths/register dist/jobs.js "upsert-password"

2024-10-07 07:17:33 [42] [info]: Running job: upsert-password 2024-10-07 07:17:33 [42] [info]: Job upsert-password completed ▲ Next.js 14.1.0

Let us know where exactly you are facing this issue?

vijay-wandb commented 1 month ago

@amanpruthi The error happens when the controller deploys the operator-wandbhelm charts. To reproduce that issue, you need to start from scratch. 1) Extract the manifests: helm template operator wandb/operator --namespace test --output-dir ./test

2) cd to the folders containing the mainifests

3) Edit wandb-manager-role Clusterrole by removing below permissions - nodes - nodes/metrics - nodes/spec - nodes/stats - nodes/proxy

4) From that folder, apply the manifests: k apply -f .

5) Then you should see the error in the controller logs

danielpanzella commented 1 month ago

I tested with the most recent Operator v1.14.0...those errors are from the operator log, so the issue has to do with operator not operator-wandb

@vijay-wandb This is an incorrect statement. The issue is that you manually removed permissions from operator's role, so it can now not grant those permissions to the wandb-console role. @amanpruthi removed the node permissions request from operator-wandb here, can you please confirm you are testing with operator-wandb chart later than 0.18.1

amanpruthi commented 1 month ago

As discussed with @vijay-wandb the issue is now solved and we are closing this

vijay-wandb commented 1 month ago

Thank you very much Aman and Aastha for hopping on a call and resolving the problem. The release channel is mapped to operator-wandb-0.17.9, and my testing used this default chart. I'm good now after manually pointing to the latest chart.