kubeflow / katib

Automated Machine Learning on Kubernetes
https://www.kubeflow.org/docs/components/katib
Apache License 2.0
1.49k stars 441 forks source link

katib random-example experiment pod stuck in NotReady state #1577

Closed kwokon0ng closed 2 years ago

kwokon0ng commented 3 years ago

Hi, please help with my issue.

Environment: katib version : v0.11.1 Kubernetes version: v1.20.7 (minikube) installed from latest July 12 2021 manifest https://github.com/kubeflow/manifests/archive/refs/heads/master.zip

Tried twice with different experiment name " random-example" and "random-example-2"

NAMESPACE NAME READY STATUS RESTARTS AGE kubeflow random-example-2-78plm8nq-d94xl 2/3 NotReady 0 24h kubeflow random-example-2-random-56f555d986-xj6fl 1/1 Running 0 24h kubeflow random-example-2-txz7gzcm-sbhfl 2/3 NotReady 0 24h kubeflow random-example-2-zxfpdj99-wddq8 2/3 NotReady 0 24h kubeflow random-example-jnchd92g-6hzxm 2/3 NotReady 0 24h kubeflow random-example-ljkfjskn-7zvpz 2/3 NotReady 0 24h kubeflow random-example-mnj7dqw9-sptgh 2/3 NotReady 0 24h kubeflow random-example-random-64599d9574-p7mh7 1/1 Running 0 24h

Stuck for a day already: NAME TYPE STATUS AGE random-example Running True 24h random-example-2 Running True 24h

Below is describe pod output of random-example-2-78plm8nq-d94xl, don't really see error or reason in "NotReady" state.

* describe pod starts *** Name: random-example-2-78plm8nq-d94xl Namespace: kubeflow Priority: 0 Node: minikube-kf1.3/192.168.58.2 Start Time: Mon, 12 Jul 2021 22:17:56 -0400 Labels: controller-uid=59a82ade-3438-47f0-9a10-2fbc18dd9b68 istio.io/rev=default job-name=random-example-2-78plm8nq security.istio.io/tlsMode=istio service.istio.io/canonical-name=random-example-2-78plm8nq service.istio.io/canonical-revision=latest Annotations: kubectl.kubernetes.io/default-logs-container: training-container-2 prometheus.io/path: /stats/prometheus prometheus.io/port: 15020 prometheus.io/scrape: true sidecar.istio.io/status: {"initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["istio-envoy","istio-data","istio-podinfo","istio-token","istiod-... Status: Running IP: 172.17.0.67 IPs: IP: 172.17.0.67 Controlled By: Job/random-example-2-78plm8nq Init Containers: istio-init: Container ID: docker://62ed18945f67c85dc1761acdfcc49bc3dddf99a826e1f25bb649d1cc3baaf1b2 Image: docker.io/istio/proxyv2:1.9.6 Image ID: docker-pullable://istio/proxyv2@sha256:87a9db561d2ef628deea7a4cbd0adf008a2f64355a2796e3b840d445b7e9cd3e Port: Host Port: Args: istio-iptables -p 15001 -z 15006 -u 1337 -m REDIRECT -i * -x

  -b
  *
  -d
  15090,15021,15020
State:          Terminated
  Reason:       Completed
  Exit Code:    0
  Started:      Mon, 12 Jul 2021 22:17:59 -0400
  Finished:     Mon, 12 Jul 2021 22:17:59 -0400
Ready:          True
Restart Count:  0
Limits:
  cpu:     2
  memory:  1Gi
Requests:
  cpu:        10m
  memory:     40Mi
Environment:  <none>
Mounts:
  /var/run/secrets/kubernetes.io/serviceaccount from default-token-ptcr5 (ro)

Containers: training-container-2: Container ID: docker://5b8748c9df19c680641c2f24406f3c694d0251f5a1fa6d0c1dffdef9b1aec656 Image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-45c5727 Image ID: docker-pullable://kubeflowkatib/mxnet-mnist@sha256:9bbfc47d1fc369e79d0b4e83f26b3060941eb0d0792c758a4ce27b4bd90a6c48 Port: Host Port: Command: sh -c Args: python3 /opt/mxnet-mnist/mnist.py --batch-size=64 --lr=0.010837550665546624 --num-layers=3 --optimizer=ftrl 1>/var/log/katib/metrics.log 2>&1 && echo completed > /var/log/katib/$$$$.pid State: Terminated Reason: Completed Exit Code: 0 Started: Mon, 12 Jul 2021 22:18:00 -0400 Finished: Mon, 12 Jul 2021 22:35:35 -0400 Ready: False Restart Count: 0 Environment: Mounts: /var/log/katib from metrics-volume (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-ptcr5 (ro) istio-proxy: Container ID: docker://de101074cf5d0f1ed6073f67b0bc8e967da558e3bfcef80941540f8397c0f917 Image: docker.io/istio/proxyv2:1.9.6 Image ID: docker-pullable://istio/proxyv2@sha256:87a9db561d2ef628deea7a4cbd0adf008a2f64355a2796e3b840d445b7e9cd3e Port: 15090/TCP Host Port: 0/TCP Args: proxy sidecar --domain $(POD_NAMESPACE).svc.cluster.local --serviceCluster random-example-2-78plm8nq.kubeflow --proxyLogLevel=warning --proxyComponentLogLevel=misc:error --log_output_level=default:info --concurrency 2 State: Running Started: Mon, 12 Jul 2021 22:18:02 -0400 Ready: True Restart Count: 0 Limits: cpu: 2 memory: 1Gi Requests: cpu: 10m memory: 40Mi Readiness: http-get http://:15021/healthz/ready delay=1s timeout=3s period=2s #success=1 #failure=30 Environment: JWT_POLICY: third-party-jwt PILOT_CERT_PROVIDER: istiod CA_ADDR: istiod.istio-system.svc:15012 POD_NAME: random-example-2-78plm8nq-d94xl (v1:metadata.name) POD_NAMESPACE: kubeflow (v1:metadata.namespace) INSTANCE_IP: (v1:status.podIP) SERVICE_ACCOUNT: (v1:spec.serviceAccountName) HOST_IP: (v1:status.hostIP) CANONICAL_SERVICE: (v1:metadata.labels['service.istio.io/canonical-name']) CANONICAL_REVISION: (v1:metadata.labels['service.istio.io/canonical-revision']) PROXY_CONFIG: {"tracing":{}}

  ISTIO_META_POD_PORTS:          [
                                 ]
  ISTIO_META_APP_CONTAINERS:     training-container-2
  ISTIO_META_CLUSTER_ID:         Kubernetes
  ISTIO_META_INTERCEPTION_MODE:  REDIRECT
  ISTIO_META_WORKLOAD_NAME:      random-example-2-78plm8nq
  ISTIO_META_OWNER:              kubernetes://apis/batch/v1/namespaces/kubeflow/jobs/random-example-2-78plm8nq
  ISTIO_META_MESH_ID:            cluster.local
  TRUST_DOMAIN:                  cluster.local
Mounts:
  /etc/istio/pod from istio-podinfo (rw)
  /etc/istio/proxy from istio-envoy (rw)
  /var/lib/istio/data from istio-data (rw)
  /var/run/secrets/istio from istiod-ca-cert (rw)
  /var/run/secrets/kubernetes.io/serviceaccount from default-token-ptcr5 (ro)
  /var/run/secrets/tokens from istio-token (rw)

metrics-logger-and-collector: Container ID: docker://2e9ec25e658ff7c1f93be9df9917389dd96aa8b0a12e6e8941621918e6225d70 Image: docker.io/kubeflowkatib/file-metrics-collector:v0.11.1 Image ID: docker-pullable://kubeflowkatib/file-metrics-collector@sha256:8e846b945b72c74269b5278fa282644537a54fb99f3f2e4b4c7f332117c253b8 Port: Host Port: Args: -t random-example-2-78plm8nq -m Validation-accuracy;Train-accuracy -o-type maximize -s-db katib-db-manager.kubeflow:6789 -path /var/log/katib/metrics.log State: Running Started: Mon, 12 Jul 2021 22:18:02 -0400 Ready: True Restart Count: 0 Limits: cpu: 500m ephemeral-storage: 5Gi memory: 100Mi Requests: cpu: 50m ephemeral-storage: 500Mi memory: 10Mi Environment: Mounts: /var/log/katib from metrics-volume (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-ptcr5 (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: istio-envoy: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: Memory SizeLimit: istio-data: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: istio-podinfo: Type: DownwardAPI (a volume populated by information about the pod) Items: metadata.labels -> labels metadata.annotations -> annotations limits.cpu -> cpu-limit requests.cpu -> cpu-request istio-token: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 43200 istiod-ca-cert: Type: ConfigMap (a volume populated by a ConfigMap) Name: istio-ca-root-cert Optional: false default-token-ptcr5: Type: Secret (a volume populated by a Secret) SecretName: default-token-ptcr5 Optional: false metrics-volume: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: QoS Class: Burstable Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events:

* describe pod ends ***

11mhg commented 3 years ago

I am getting this same issue. Did you manage to fix it? If so, how?

kwokon0ng commented 3 years ago

I am getting this same issue. Did you manage to fix it? If so, how?

No luck, still same issue.

andreyvelich commented 3 years ago

Hi @kwokon0ng and @11mhg, I think you forgot to disable istio-sidecar (sidecar.istio.io/inject: "false") for your Training containers. Please check step 3 here: https://www.kubeflow.org/docs/components/katib/hyperparameter/#example-using-random-algorithm.

kwokon0ng commented 3 years ago

Hi @kwokon0ng and @11mhg, I think you forgot to disable istio-sidecar (sidecar.istio.io/inject: "false") for your Training containers. Please check step 3 here: https://www.kubeflow.org/docs/components/katib/hyperparameter/#example-using-random-algorithm.

Thank you it works now

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.