shikanon / kubeflow-manifests

kubeflow国内一键安装文件
GNU General Public License v3.0
337 stars 117 forks source link

profiles-deployment CrashLoopBackOff #71

Closed DC-y closed 2 years ago

DC-y commented 2 years ago

Hi, 大佬你好,按照这个项目的步骤执行,现在遇到一个问题,麻烦大佬帮忙看一下 当前pod 的状态都是running, 除了profiles-deployment-f7bfd656-b2jgq 这个pod

# kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.11", GitCommit:"d94a81c724ea8e1ccc9002d89b7fe81d58f89ede", GitTreeState:"clean", BuildDate:"2020-03-12T21:00:06Z", GoVersion:"go1.12.17", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.11", GitCommit:"d94a81c724ea8e1ccc9002d89b7fe81d58f89ede", GitTreeState:"clean", BuildDate:"2020-03-12T21:00:06Z", GoVersion:"go1.12.17", Compiler:"gc", Platform:"linux/amd64"}
NAMESPACE          NAME                                                         READY   STATUS             RESTARTS   AGE
auth               dex-558849cf5c-dn8bf                                         1/1     Running            0          3d18h
cert-manager       cert-manager-58dc4c9cdc-g8qnv                                1/1     Running            0          3d18h
cert-manager       cert-manager-cainjector-6cdd54b8b9-7jd65                     1/1     Running            0          3d18h
cert-manager       cert-manager-webhook-75f8c4594d-qs9wx                        1/1     Running            2          3d18h
default            nfs-client-provisioner-6fcddc6c4b-dldf4                      1/1     Running            0          3h9m
istio-system       authservice-0                                                1/1     Running            0          3h1m
istio-system       cluster-local-gateway-5d6b5cd5c4-4s7h5                       1/1     Running            0          3d18h
istio-system       istio-ingressgateway-55ccf6f989-nxdcl                        1/1     Running            0          3d18h
istio-system       istiod-69ddfb489d-cx7cj                                      1/1     Running            0          3d18h
knative-eventing   broker-controller-6c8d4f7f4c-hcwwb                           1/1     Running            0          3d18h
knative-eventing   eventing-controller-5769c9d8d4-zcxts                         1/1     Running            0          3d18h
knative-eventing   eventing-webhook-6c8b77869d-2fz9k                            1/1     Running            0          3d18h
knative-eventing   imc-controller-76cd88c474-qgd6x                              1/1     Running            0          3d18h
knative-eventing   imc-dispatcher-66d768c7d4-skgqv                              1/1     Running            0          3d18h
knative-serving    activator-6489798c75-q6c4v                                   1/1     Running            0          3d18h
knative-serving    autoscaler-6796678ccf-brfqr                                  1/1     Running            1          3d18h
knative-serving    controller-dc65b7959-xg58k                                   1/1     Running            0          3d18h
knative-serving    istio-webhook-78f4bdbc64-twxxc                               1/1     Running            0          3d18h
knative-serving    networking-istio-5dd75ff966-8p9dr                            1/1     Running            0          3d18h
knative-serving    webhook-bdb869c55-tp4tf                                      1/1     Running            0          3d18h
kube-system        calico-kube-controllers-687d8b5b69-4t8k8                     1/1     Running            0          3d20h
kube-system        calico-node-2tjt5                                            1/1     Running            0          3d20h
kube-system        calico-node-bdns9                                            1/1     Running            1          3d20h
kube-system        coredns-67db989964-6k24g                                     1/1     Running            0          3d20h
kube-system        dns-autoscaler-546df4cf94-9qj5r                              1/1     Running            0          3d20h
kube-system        k8s-host-device-plugin-daemonset-sb57p                       1/1     Running            0          3d19h
kube-system        kube-apiserver-pai-master                                    1/1     Running            0          3d20h
kube-system        kube-controller-manager-pai-master                           1/1     Running            0          3d20h
kube-system        kube-proxy-26489                                             1/1     Running            1          3d20h
kube-system        kube-proxy-xdz7n                                             1/1     Running            0          3d20h
kube-system        kube-scheduler-pai-master                                    1/1     Running            0          3d20h
kube-system        nginx-proxy-node03                                           1/1     Running            1          3d19h
kube-system        nvidia-device-plugin-daemonset-d629r                         1/1     Running            0          3d19h
kubeflow           admission-webhook-deployment-84978df699-s9z8f                1/1     Running            0          3d18h
kubeflow           cache-deployer-deployment-656fbfdff6-fx2wk                   2/2     Running            8          3d18h
kubeflow           cache-server-855f55b49c-fqgg5                                2/2     Running            0          3d18h
kubeflow           centraldashboard-5446d8d7b6-fxg9k                            1/1     Running            0          3d18h
kubeflow           jupyter-web-app-deployment-5bb47f59f-jwsc4                   1/1     Running            0          3d18h
kubeflow           katib-controller-6698ccdd4f-wb9xx                            1/1     Running            0          3d18h
kubeflow           katib-db-manager-96f6d6fd7-qwwmd                             1/1     Running            1          3h2m
kubeflow           katib-mysql-5d6dc57fc9-r4blx                                 1/1     Running            0          3h3m
kubeflow           katib-ui-65b974b9b5-z2v44                                    1/1     Running            0          3d18h
kubeflow           kubeflow-pipelines-profile-controller-54488b445c-f59pz       1/1     Running            0          3d18h
kubeflow           metacontroller-0                                             1/1     Running            0          3d18h
kubeflow           metadata-envoy-deployment-5b78576c47-844xv                   1/1     Running            0          3d18h
kubeflow           metadata-grpc-deployment-54489d4c97-nzmp8                    2/2     Running            8          3d18h
kubeflow           metadata-writer-76654bcc64-5lvrf                             2/2     Running            11         3d18h
kubeflow           minio-564bd4dc95-n4knl                                       2/2     Running            0          3d18h
kubeflow           ml-pipeline-577cd87d6f-htclv                                 2/2     Running            5          3d18h
kubeflow           ml-pipeline-persistenceagent-8594df68dc-c99pl                2/2     Running            1          3d18h
kubeflow           ml-pipeline-scheduledworkflow-7c66fbdc7f-fgrxd               2/2     Running            0          3d18h
kubeflow           ml-pipeline-ui-7586d978c6-x25d8                              2/2     Running            0          3d18h
kubeflow           ml-pipeline-viewer-crd-54f9b5dd7d-pqhcm                      2/2     Running            7          3d18h
kubeflow           ml-pipeline-visualizationserver-5fb899f7dc-g6dkl             2/2     Running            0          3d18h
kubeflow           mpi-operator-7c9fd9c6c7-qmndc                                1/1     Running            1          3d18h
kubeflow           mxnet-operator-6469447c4c-qvq7w                              1/1     Running            1          3d18h
kubeflow           mysql-6459d667c8-rvmc2                                       2/2     Running            0          3d18h
kubeflow           notebook-controller-deployment-8487fdffd7-b6x8z              1/1     Running            0          3d18h
kubeflow           profiles-deployment-f7bfd656-6klbv                           1/2     CrashLoopBackOff   6          9m18s
kubeflow           pytorch-operator-5786d464fb-lgjtd                            2/2     Running            2          3d18h
kubeflow           tensorboard-controller-controller-manager-6b8bb64848-k75jd   3/3     Running            14         3d18h
kubeflow           tensorboards-web-app-deployment-6dff9b4779-xn6k2             1/1     Running            0          3d18h
kubeflow           tf-job-operator-847cc955bf-d2j26                             1/1     Running            3          3d18h
kubeflow           volumes-web-app-deployment-55469959d8-bw75h                  1/1     Running            0          3d18h
kubeflow           workflow-controller-6977465d7b-cw92d                         2/2     Running            8          3d18h
kubeflow           xgboost-operator-deployment-67dd577579-8pbfb                 2/2     Running            9          3d18h

descibe pod 显示

Events:
  Type     Reason     Age                   From                 Message
  ----     ------     ----                  ----                 -------
  Normal   Scheduled  18m                   default-scheduler    Successfully assigned kubeflow/profiles-deployment-f7bfd656-b2jgq to pai-master
  Normal   Pulling    17m                   kubelet, pai-master  Pulling image "registry.cn-shenzhen.aliyuncs.com/tensorbytes/notebooks-access-management:v1.3.0-rc.0-a869b"
  Normal   Pulled     17m                   kubelet, pai-master  Successfully pulled image "registry.cn-shenzhen.aliyuncs.com/tensorbytes/notebooks-access-management:v1.3.0-rc.0-a869b"
  Normal   Created    17m                   kubelet, pai-master  Created container kfam
  Normal   Started    17m                   kubelet, pai-master  Started container kfam
  Normal   Pulling    16m (x4 over 17m)     kubelet, pai-master  Pulling image "registry.cn-shenzhen.aliyuncs.com/tensorbytes/notebooks-profile-controller:v1.3.0-rc.0-ce3b3"
  Normal   Pulled     16m (x4 over 17m)     kubelet, pai-master  Successfully pulled image "registry.cn-shenzhen.aliyuncs.com/tensorbytes/notebooks-profile-controller:v1.3.0-rc.0-ce3b3"
  Normal   Created    16m (x4 over 17m)     kubelet, pai-master  Created container manager
  Normal   Started    16m (x4 over 17m)     kubelet, pai-master  Started container manager
  Warning  BackOff    2m40s (x68 over 17m)  kubelet, pai-master  Back-off restarting failed container

logs pod 显示

root@pai-master:/home/xxx/kubeflow-manifests/storage# kubectl logs  profiles-deployment-f7bfd656-b2jgq -c kfam -n kubeflow
time="2021-09-22T03:35:10Z" level=info msg="Server started"
2021/09/22 03:35:46 GET /metrics PrometheusMetrics 13.324579ms
2021/09/22 03:36:16 GET /metrics PrometheusMetrics 2.284134ms
2021/09/22 03:36:46 GET /metrics PrometheusMetrics 1.932439ms
2021/09/22 03:37:16 GET /metrics PrometheusMetrics 4.683953ms
2021/09/22 03:37:46 GET /metrics PrometheusMetrics 9.086798ms
2021/09/22 03:38:16 GET /metrics PrometheusMetrics 3.951332ms
2021/09/22 03:38:46 GET /metrics PrometheusMetrics 5.274586ms

root@pai-master:/home/xxx/kubeflow-manifests/storage# kubectl logs  profiles-deployment-f7bfd656-b2jgq -c manager -n kubeflow
I0922 04:02:45.122337       1 request.go:645] Throttling request took 1.037453555s, request: GET:https://10.192.0.1:443/apis/apiextensions.k8s.io/v1beta1?timeout=32s
2021-09-22T04:02:46.275Z        INFO    controller-runtime.metrics      metrics server is starting to listen    {"addr": ":8080"}
2021-09-22T04:02:46.275Z        INFO    setup   starting manager
2021-09-22T04:02:46.284Z        INFO    controller-runtime.manager      starting metrics server {"path": "/metrics"}
2021-09-22T04:02:46.284Z        INFO    controller      Starting EventSource    {"reconcilerGroup": "kubeflow.org", "reconcilerKind": "Profile", "controller": "profile", "source": "kind source: /, Kind="}
2021-09-22T04:02:49.977Z        ERROR   controller-runtime.source       if kind is a CRD, it should be installed before calling Start   {"kind": "Profile.kubeflow.org", "error": "no matches for kind \"Profile\" in version \"kubeflow.org/v1\""}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:143
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:184
sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).startRunnable.func1
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/manager/internal.go:661
2021-09-22T04:02:49.977Z        ERROR   setup   problem running manager {"error": "no matches for kind \"Profile\" in version \"kubeflow.org/v1\""}
runtime.main
        /usr/local/go/src/runtime/proc.go:204
root@pai-master:/home/xxx/kubeflow-manifests/storage#

After reinstall, I have met the error as blow show:

unable to recognize "./manifest1.3/019-katib-installs-katib-with-kubeflow-cert-manager.yaml": no matches for kind "MutatingWebhookConfiguration" in version "admissionregistration.k8s.io/v1"

unable to recognize "./manifest1.3/019-katib-installs-katib-with-kubeflow-cert-manager.yaml": no matches for kind "ValidatingWebhookConfiguration" in version "admissionregistration.k8s.io/v1"

error: unable to recognize "./manifest1.3/024-profiles-overlays-kubeflow.yaml": no matches for kind "CustomResourceDefinition" in version "apiextensions.k8s.io/v1"

error: unable to recognize "./manifest1.3/033-user-namespace-user-namespace-base.yaml": no matches for kind "Profile" in version "kubeflow.org/v1beta1"

error: unable to recognize "./patch/auth.yaml": no matches for kind "Profile" in version "kubeflow.org/v1beta1"

Error from server (NotFound): error when deleting "./patch/kfserving.yaml": configmaps "inferenceservice-config" not found
shikanon commented 2 years ago

@DC-y 你可以卸载重装看看,我看错误是版本不兼容,应该是你之前就有安装过kubeflow吧,你先卸载:

kubectl delete -f manifest1.3/

然后再安装:

python install.py
DC-y commented 2 years ago

我刚刚重装了,还是一样的问题,请问你安装的k8s的版本是哪一个版本啊~~

shikanon commented 2 years ago

@DC-y k8s版本是 kindest/node:v1.16.9,应该1.16+都是可以的

DC-y commented 2 years ago

好,谢谢,我再试试看