kubeflow / manifests

A repository for Kustomize manifests
Apache License 2.0
801 stars 864 forks source link

RBAC: access denied on central dashboard #2832

Closed pritamdodeja closed 2 weeks ago

pritamdodeja commented 1 month ago

Validation Checklist

Version

1.9

Describe your issue

After installation, post running

kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80

And using localhost:8080, I do not get any response.

When I try to go to the centraldashboard, I get RBAC: access denied

Steps to reproduce the issue

Create default storage class with rook-ceph Follow instructions from https://github.com/kubeflow/manifests?tab=readme-ov-file#upgrading-and-extending after checking out 1.9 release from manifests repo.

Possibly related, seeing

kubectl get pods --all-namespaces | grep -vi Running NAMESPACE NAME READY STATUS RESTARTS AGE istio-system kubeflow-m2m-oidc-configurator-28711075-5ktvt 0/1 Error 1 (13s ago) 18s rook-ceph rook-ceph-osd-prepare-distml-6f6kh 0/1 Completed 0 92m

Put here any screenshots or videos (optional)

No response

juliusvonkohout commented 1 month ago

You need to check the pod logs. Our tutorial is for Kind and it might be different on other Kubernetes cluster types.

pritamdodeja commented 1 month ago

I have set this up in kind as well as k8s with version 1.9.0. Kind is working as expected. Will see if I can figure out what the delta is. Would appreciate any direction you can provide. Thank you!

juliusvonkohout commented 1 month ago

istio-system kubeflow-m2m-oidc-configurator-28711075-5ktvt must be checked and fixed. There is commercial consulting and there are commercial distributions available if you are interested.

juliusvonkohout commented 4 weeks ago

Please check https://github.com/kubeflow/manifests/issues/2840 as well

thesuperzapper commented 3 weeks ago

@pritamdodeja by any chance are you using EKS?

pritamdodeja commented 3 weeks ago

@pritamdodeja by any chance are you using EKS?

I'm using k8s on fedora 40 locally. Machine with two gpus, hoping to get flink operator to do some distributed processing with tfx pipelines. A pipe dream maybe, but that's the goal :)

thesuperzapper commented 3 weeks ago

@pritamdodeja @juliusvonkohout my bet is that because kubectl apply does not clean up removed resources, people are leaving old AuthorizationPolicy resources which are breaking the new oauth2-proxy based auth.

We need to give people a command to remove the ones from <1.8.0 so they dont all run into this issue.

This is part of why I made deployKF, because there is really no safe upgrade path without using something like ArgoCD to manage the cleanup of resources.


To help people clean up extra AuthorizationPolicies, here is a list of all the ones from a stock 1.9.0 install on my test cluster:

> kubectl get authorizationpolicy --all-namespaces
NAMESPACE                   NAME                                ACTION   AGE
istio-system                cluster-local-gateway               ALLOW    24h
istio-system                global-deny-all                              24h
istio-system                istio-ingressgateway                ALLOW    24h
istio-system                istio-ingressgateway-oauth2-proxy   CUSTOM   24h
knative-serving             activator-service                   ALLOW    24h
knative-serving             autoscaler                          ALLOW    24h
knative-serving             controller                          ALLOW    24h
knative-serving             istio-webhook                       ALLOW    24h
knative-serving             webhook                             ALLOW    24h
kubeflow-user-example-com   ml-pipeline-visualizationserver              24h
kubeflow-user-example-com   ns-owner-access-istio                        24h
kubeflow                    central-dashboard                   ALLOW    24h
kubeflow                    jupyter-web-app                     ALLOW    24h
kubeflow                    katib-ui                            ALLOW    24h
kubeflow                    kserve-models-web-app               ALLOW    24h
kubeflow                    metadata-grpc-service               ALLOW    24h
kubeflow                    minio-service                       ALLOW    24h
kubeflow                    ml-pipeline                                  24h
kubeflow                    ml-pipeline-ui                               24h
kubeflow                    ml-pipeline-visualizationserver              24h
kubeflow                    mysql                                        24h
kubeflow                    profiles-kfam                       ALLOW    24h
kubeflow                    service-cache-server                         24h
kubeflow                    tensorboards-web-app                ALLOW    24h
kubeflow                    volumes-web-app                     ALLOW    24h
juliusvonkohout commented 3 weeks ago

Well you can use labels and pruning as mentioned in the readme to get it done. But these are only rough guidelines so far, not detailed enough for new users.

Given some volunteers to work on it we could provide detailed upgrade instructions.

We can include a few upgrade commands in the readme.

pritamdodeja commented 3 weeks ago

My situation actually is a new install, and I set up a default storage class (rook-ceph) as listed in the documentation. I'd love to help out in whichever way possible. I do have pretty good linux knowledge and have used KubeflowDagrunner to port local tfx pipelines to vertex etc., and have also just finished the CNCF class on Kubeflow pipelines. Thank you both!

thesuperzapper commented 3 weeks ago

Well you can use labels and pruning as mentioned in the readme to get it done. But these are only rough guidelines so far, not detailed enough for new users.

Given some volunteers to work on it we could provide detailed upgrade instructions.

We can include a few upgrade commands in the readme.

@juliusvonkohout I still believe the manifests should be aimed at distribution vendors and highly advanced users who want to effectively roll their own distribution.

As soon as you start talking about opinionated ways to do updates, you probably are better off making your own distribution based on the manifests and advertising it to users to let the market decide which approach is best.

I'm not saying we can't list some basic suggestions, but it's hard to imagine a proper update solution that wouldn't become so opinionated as to make the manifest less useful downstream vendors.

For the vast majority of users, an opinionated Kubeflow distribution from a vendor they know will keep maintaining it is going to save them a lot of pain, and may be the difference between using Kubeflow or not.

thesuperzapper commented 3 weeks ago

My situation actually is a new install, and I set up a default storage class (rook-ceph) as listed in the documentation. I'd love to help out in whichever way possible. I do have pretty good linux knowledge and have used KubeflowDagrunner to port local tfx pipelines to vertex etc., and have also just finished the CNCF class on Kubeflow pipelines. Thank you both!

@pritamdodeja then you might just have some other issue with your cluster, especially because you're also seeing the CronJob fail.

We are working on a fix that removes the need for the CronJob in https://github.com/kubeflow/manifests/issues/2850, if you want to try it out.

Although, that job failing should not prevent you accessing the central dashboard, so perhaps your cluster is just running out of resources / open file descriptors?

juliusvonkohout commented 2 weeks ago

Lets merge into https://github.com/kubeflow/manifests/issues/2850