shikanon / kubeflow-manifests

kubeflow国内一键安装文件
GNU General Public License v3.0
338 stars 117 forks source link

pods 不能全部正常运行 #59

Closed yuezhilanyi closed 3 years ago

yuezhilanyi commented 3 years ago

使用kind创建集群,然后运行 python install.py 安装kubeflow后,pods不能全部运行

sudo kubectl get pods -A

image

sudo kubectl get pods -nkubeflow

image

yuezhilanyi commented 3 years ago

istio 概要如下

(base) yh@fuxi239:~$ sudo kubectl describe pod istio-ingressgateway-6c7fc58d56-zzgzr -n istio-system Name: istio-ingressgateway-6c7fc58d56-zzgzr Namespace: istio-system Priority: 0 Node: kubeflow-control-plane/172.18.0.2 Start Time: Fri, 13 Aug 2021 17:12:37 +0800 Labels: app=istio-ingressgateway chart=gateways heritage=Tiller install.operator.istio.io/owning-resource=unknown istio=ingressgateway istio.io/rev=default operator.istio.io/component=IngressGateways pod-template-hash=6c7fc58d56 release=istio service.istio.io/canonical-name=istio-ingressgateway service.istio.io/canonical-revision=latest sidecar.istio.io/inject=false Annotations: prometheus.io/path: /stats/prometheus prometheus.io/port: 15020 prometheus.io/scrape: true sidecar.istio.io/inject: false Status: Pending IP: IPs: Controlled By: ReplicaSet/istio-ingressgateway-6c7fc58d56 Containers: istio-proxy: Container ID: Image: registry.cn-shenzhen.aliyuncs.com/tensorbytes/istio-proxyv2:1.9.0-e8a74 Image ID: Ports: 15021/TCP, 8080/TCP, 8443/TCP, 31400/TCP, 15443/TCP, 15090/TCP Host Ports: 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP Args: proxy router --domain $(POD_NAMESPACE).svc.cluster.local --proxyLogLevel=warning --proxyComponentLogLevel=misc:error --log_output_level=default:info --serviceCluster istio-ingressgateway State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Limits: cpu: 2 memory: 1Gi Requests: cpu: 10m memory: 40Mi Readiness: http-get http://:15021/healthz/ready delay=1s timeout=1s period=2s #success=1 #failure=30 Environment: JWT_POLICY: third-party-jwt PILOT_CERT_PROVIDER: istiod CA_ADDR: istiod.istio-system.svc:15012 NODE_NAME: (v1:spec.nodeName) POD_NAME: istio-ingressgateway-6c7fc58d56-zzgzr (v1:metadata.name) POD_NAMESPACE: istio-system (v1:metadata.namespace) INSTANCE_IP: (v1:status.podIP) HOST_IP: (v1:status.hostIP) SERVICE_ACCOUNT: (v1:spec.serviceAccountName) CANONICAL_SERVICE: (v1:metadata.labels['service.istio.io/canonical-name']) CANONICAL_REVISION: (v1:metadata.labels['service.istio.io/canonical-revision']) ISTIO_META_WORKLOAD_NAME: istio-ingressgateway ISTIO_META_OWNER: kubernetes://apis/apps/v1/namespaces/istio-system/deployments/istio-ingressgateway ISTIO_META_UNPRIVILEGED_POD: true ISTIO_META_ROUTER_MODE: standard ISTIO_META_CLUSTER_ID: Kubernetes Mounts: /etc/istio/config from config-volume (rw) /etc/istio/ingressgateway-ca-certs from ingressgateway-ca-certs (ro) /etc/istio/ingressgateway-certs from ingressgateway-certs (ro) /etc/istio/pod from podinfo (rw) /etc/istio/proxy from istio-envoy (rw) /var/lib/istio/data from istio-data (rw) /var/run/secrets/istio from istiod-ca-cert (rw) /var/run/secrets/kubernetes.io/serviceaccount from istio-ingressgateway-service-account-token-5lnfq (ro) /var/run/secrets/tokens from istio-token (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: istiod-ca-cert: Type: ConfigMap (a volume populated by a ConfigMap) Name: istio-ca-root-cert Optional: false podinfo: Type: DownwardAPI (a volume populated by information about the pod) Items: metadata.labels -> labels metadata.annotations -> annotations limits.cpu -> cpu-limit requests.cpu -> cpu-request istio-envoy: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: istio-data: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: istio-token: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 43200 config-volume: Type: ConfigMap (a volume populated by a ConfigMap) Name: istio Optional: true ingressgateway-certs: Type: Secret (a volume populated by a Secret) SecretName: istio-ingressgateway-certs Optional: true ingressgateway-ca-certs: Type: Secret (a volume populated by a Secret) SecretName: istio-ingressgateway-ca-certs Optional: true istio-ingressgateway-service-account-token-5lnfq: Type: Secret (a volume populated by a Secret) SecretName: istio-ingressgateway-service-account-token-5lnfq Optional: false QoS Class: Burstable Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s

Events: Type Reason Age From Message


Warning FailedMount 4m52s (x81 over 171m) kubelet MountVolume.SetUp failed for volume "istio-token" : failed to fetch token: the server could not find the requested resource Warning FailedMount 47s (x77 over 152m) kubelet (combined from similar events): MountVolume.SetUp failed for volume "istio-token" : failed to fetch token: the server could not find the requested resource

yuezhilanyi commented 3 years ago

kubeflow其中一个如下,很奇怪找不到 proxyv2

(base) yh@fuxi239:~$ sudo kubectl describe pod cache-deployer-deployment-7558d65bf4-ztlkh -n kubeflow

Name: cache-deployer-deployment-7558d65bf4-ztlkh Namespace: kubeflow Priority: 0 Node: kubeflow-control-plane/172.18.0.2 Start Time: Thu, 12 Aug 2021 10:07:33 +0800 Labels: app=cache-deployer app.kubernetes.io/component=ml-pipeline app.kubernetes.io/name=kubeflow-pipelines application-crd-id=kubeflow-pipelines istio.io/rev=default pod-template-hash=7558d65bf4 security.istio.io/tlsMode=istio service.istio.io/canonical-name=kubeflow-pipelines service.istio.io/canonical-revision=latest Annotations: kubectl.kubernetes.io/default-logs-container: main prometheus.io/path: /stats/prometheus prometheus.io/port: 15020 prometheus.io/scrape: true sidecar.istio.io/status: {"initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["istio-envoy","istio-data","istio-podinfo","istiod-ca-cert"],"ima... Status: Pending IP: 10.244.0.20 IPs: IP: 10.244.0.20 Controlled By: ReplicaSet/cache-deployer-deployment-7558d65bf4 Init Containers: istio-init: Container ID: Image: docker.io/istio/proxyv2:1.9.0 Image ID: Port: Host Port: Args: istio-iptables -p 15001 -z 15006 -u 1337 -m REDIRECT -i * -x

  -b
  *
  -d
  15090,15021,15020
State:          Waiting
  Reason:       ErrImagePull
Ready:          False
Restart Count:  0
Limits:
  cpu:     2
  memory:  1Gi
Requests:
  cpu:        10m
  memory:     40Mi
Environment:  <none>
Mounts:
  /var/run/secrets/kubernetes.io/serviceaccount from kubeflow-pipelines-cache-deployer-sa-token-tcgsh (ro)

Containers: main: Container ID: Image: registry.cn-shenzhen.aliyuncs.com/tensorbytes/ml-pipeline-cache-deployer:1.5.0-rc.2-deb1e Image ID: Port: Host Port: State: Waiting Reason: PodInitializing Ready: False Restart Count: 0 Environment: NAMESPACE_TO_WATCH: kubeflow (v1:metadata.namespace) Mounts: /var/run/secrets/kubernetes.io/serviceaccount from kubeflow-pipelines-cache-deployer-sa-token-tcgsh (ro) istio-proxy: Container ID: Image: docker.io/istio/proxyv2:1.9.0 Image ID: Port: 15090/TCP Host Port: 0/TCP Args: proxy sidecar --domain $(POD_NAMESPACE).svc.cluster.local --serviceCluster cache-deployer.$(POD_NAMESPACE) --proxyLogLevel=warning --proxyComponentLogLevel=misc:error --log_output_level=default:info --concurrency 2 State: Waiting Reason: PodInitializing Ready: False Restart Count: 0 Limits: cpu: 2 memory: 1Gi Requests: cpu: 10m memory: 40Mi Readiness: http-get http://:15021/healthz/ready delay=1s timeout=3s period=2s #success=1 #failure=30 Environment: JWT_POLICY: first-party-jwt PILOT_CERT_PROVIDER: istiod CA_ADDR: istiod.istio-system.svc:15012 POD_NAME: cache-deployer-deployment-7558d65bf4-ztlkh (v1:metadata.name) POD_NAMESPACE: kubeflow (v1:metadata.namespace) INSTANCE_IP: (v1:status.podIP) SERVICE_ACCOUNT: (v1:spec.serviceAccountName) HOST_IP: (v1:status.hostIP) CANONICAL_SERVICE: (v1:metadata.labels['service.istio.io/canonical-name']) CANONICAL_REVISION: (v1:metadata.labels['service.istio.io/canonical-revision']) PROXY_CONFIG: {}

  ISTIO_META_POD_PORTS:          [
                                 ]
  ISTIO_META_APP_CONTAINERS:     main
  ISTIO_META_CLUSTER_ID:         Kubernetes
  ISTIO_META_INTERCEPTION_MODE:  REDIRECT
  ISTIO_META_WORKLOAD_NAME:      cache-deployer-deployment
  ISTIO_META_OWNER:              kubernetes://apis/apps/v1/namespaces/kubeflow/deployments/cache-deployer-deployment
  ISTIO_META_MESH_ID:            cluster.local
  TRUST_DOMAIN:                  cluster.local
Mounts:
  /etc/istio/pod from istio-podinfo (rw)
  /etc/istio/proxy from istio-envoy (rw)
  /var/lib/istio/data from istio-data (rw)
  /var/run/secrets/istio from istiod-ca-cert (rw)
  /var/run/secrets/kubernetes.io/serviceaccount from kubeflow-pipelines-cache-deployer-sa-token-tcgsh (ro)

Conditions: Type Status Initialized False Ready False ContainersReady False PodScheduled True Volumes: istio-envoy: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: Memory SizeLimit: istio-data: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: istio-podinfo: Type: DownwardAPI (a volume populated by information about the pod) Items: metadata.labels -> labels metadata.annotations -> annotations limits.cpu -> cpu-limit requests.cpu -> cpu-request istiod-ca-cert: Type: ConfigMap (a volume populated by a ConfigMap) Name: istio-ca-root-cert Optional: false kubeflow-pipelines-cache-deployer-sa-token-tcgsh: Type: Secret (a volume populated by a Secret) SecretName: kubeflow-pipelines-cache-deployer-sa-token-tcgsh Optional: false QoS Class: Burstable Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message


Warning Failed 52m kubelet Failed to pull image "docker.io/istio/proxyv2:1.9.0": rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/istio/proxyv2:1.9.0": failed to resolve reference "docker.io/istio/proxyv2:1.9.0": failed to do request: Head https://registry-1.docker.io/v2/istio/proxyv2/manifests/1.9.0: dial tcp 54.161.109.204:443: i/o timeout Warning Failed 34m kubelet Failed to pull image "docker.io/istio/proxyv2:1.9.0": rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/istio/proxyv2:1.9.0": failed to resolve reference "docker.io/istio/proxyv2:1.9.0": failed to do request: Head https://registry-1.docker.io/v2/istio/proxyv2/manifests/1.9.0: dial tcp 3.224.96.239:443: i/o timeout Warning Failed 16m (x10 over 178m) kubelet Error: ErrImagePull Warning Failed 16m kubelet Failed to pull image "docker.io/istio/proxyv2:1.9.0": rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/istio/proxyv2:1.9.0": failed to resolve reference "docker.io/istio/proxyv2:1.9.0": failed to do request: Head https://registry-1.docker.io/v2/istio/proxyv2/manifests/1.9.0: dial tcp 3.223.82.39:443: i/o timeout Normal Pulling 15m (x11 over 3h6m) kubelet Pulling image "docker.io/istio/proxyv2:1.9.0"

yuezhilanyi commented 3 years ago

已经确认docker,containerd环境都有proxyv2,很多pod还是不停尝试拉取在线的

image

yuezhilanyi commented 3 years ago

我怀疑是镜像导入的位置不对,因为列表中并没有配置文件中所列的 registry.cn-shenzhen.aliyuncs.com/tensorbytes/ 开头的镜像。请问下应该怎么解决这个问题??

shikanon commented 3 years ago

@yuezhilanyi 这个是开源的镜像文件,你直接 docker pull 拉去看看?

shikanon commented 3 years ago

@yuezhilanyi proxyv2 是 docker 镜像源,你那边是不是超过 docker 的限制了导致拉不了?

yuezhilanyi commented 3 years ago

@shikanon 可以拉,而且已经有了 image

jxhsjxhs commented 3 years ago

因为拉的规则是每次都拉 不是从本地获取,所以每次都去。然后超过限制了呗。

yuezhilanyi commented 3 years ago

因为拉的规则是每次都拉 不是从本地获取,所以每次都去。然后超过限制了呗。

@jxhsjxhs 请问怎么设置从本地获取? 我把006-istio中 imagePullPolicy: "{{ valueOrDefault .Values.global.imagePullPolicy Always }} 的 Always 改成了 IfNotPresent,还是不行

jxhsjxhs commented 3 years ago

@yuezhilanyi emmm我也没找到。我昨天装的时候也是一直显示获取失败。我就没管他 今天好了。我再去找找看

yuezhilanyi commented 3 years ago

将镜像加载到集群中解决了

sudo kind load docker-image docker.io/istio/proxyv2:1.9.0 docker.io/istio/proxyv2:1.9.0 --name kubeflow