katib random-example experiment pod stuck in NotReady state #1577

kwokon0ng commented 3 years ago

Hi, please help with my issue.

Environment: katib version : v0.11.1 Kubernetes version: v1.20.7 (minikube) installed from latest July 12 2021 manifest https://github.com/kubeflow/manifests/archive/refs/heads/master.zip

Tried twice with different experiment name " random-example" and "random-example-2"

NAMESPACE NAME READY STATUS RESTARTS AGE kubeflow random-example-2-78plm8nq-d94xl 2/3 NotReady 0 24h kubeflow random-example-2-random-56f555d986-xj6fl 1/1 Running 0 24h kubeflow random-example-2-txz7gzcm-sbhfl 2/3 NotReady 0 24h kubeflow random-example-2-zxfpdj99-wddq8 2/3 NotReady 0 24h kubeflow random-example-jnchd92g-6hzxm 2/3 NotReady 0 24h kubeflow random-example-ljkfjskn-7zvpz 2/3 NotReady 0 24h kubeflow random-example-mnj7dqw9-sptgh 2/3 NotReady 0 24h kubeflow random-example-random-64599d9574-p7mh7 1/1 Running 0 24h

Stuck for a day already: NAME TYPE STATUS AGE random-example Running True 24h random-example-2 Running True 24h

Below is describe pod output of random-example-2-78plm8nq-d94xl, don't really see error or reason in "NotReady" state.

* describe pod starts *** Name: random-example-2-78plm8nq-d94xl Namespace: kubeflow Priority: 0 Node: minikube-kf1.3/ Start Time: Mon, 12 Jul 2021 22:17:56 -0400 Labels: controller-uid=59a82ade-3438-47f0-9a10-2fbc18dd9b68 istio.io/rev=default job-name=random-example-2-78plm8nq security.istio.io/tlsMode=istio service.istio.io/canonical-name=random-example-2-78plm8nq service.istio.io/canonical-revision=latest Annotations: kubectl.kubernetes.io/default-logs-container: training-container-2 prometheus.io/path: /stats/prometheus prometheus.io/port: 15020 prometheus.io/scrape: true sidecar.istio.io/status: {"initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["istio-envoy","istio-data","istio-podinfo","istio-token","istiod-... Status: Running IP: IPs: IP: Controlled By: Job/random-example-2-78plm8nq Init Containers: istio-init: Container ID: docker://62ed18945f67c85dc1761acdfcc49bc3dddf99a826e1f25bb649d1cc3baaf1b2 Image: docker.io/istio/proxyv2:1.9.6 Image ID: docker-pullable://istio/proxyv2@sha256:87a9db561d2ef628deea7a4cbd0adf008a2f64355a2796e3b840d445b7e9cd3e Port: Host Port: Args: istio-iptables -p 15001 -z 15006 -u 1337 -m REDIRECT -i * -x

State:          Terminated
  Reason:       Completed
  Exit Code:    0
  Started:      Mon, 12 Jul 2021 22:17:59 -0400
  Finished:     Mon, 12 Jul 2021 22:17:59 -0400
Ready:          True
Restart Count:  0
  cpu:     2
  memory:  1Gi
  cpu:        10m
  memory:     40Mi
Environment:  <none>
  /var/run/secrets/kubernetes.io/serviceaccount from default-token-ptcr5 (ro)

Containers: training-container-2: Container ID: docker://5b8748c9df19c680641c2f24406f3c694d0251f5a1fa6d0c1dffdef9b1aec656 Image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-45c5727 Image ID: docker-pullable://kubeflowkatib/mxnet-mnist@sha256:9bbfc47d1fc369e79d0b4e83f26b3060941eb0d0792c758a4ce27b4bd90a6c48 Port: Host Port: Command: sh -c Args: python3 /opt/mxnet-mnist/mnist.py --batch-size=64 --lr=0.010837550665546624 --num-layers=3 --optimizer=ftrl 1>/var/log/katib/metrics.log 2>&1 && echo completed > /var/log/katib/$$$$.pid State: Terminated Reason: Completed Exit Code: 0 Started: Mon, 12 Jul 2021 22:18:00 -0400 Finished: Mon, 12 Jul 2021 22:35:35 -0400 Ready: False Restart Count: 0 Environment: Mounts: /var/log/katib from metrics-volume (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-ptcr5 (ro) istio-proxy: Container ID: docker://de101074cf5d0f1ed6073f67b0bc8e967da558e3bfcef80941540f8397c0f917 Image: docker.io/istio/proxyv2:1.9.6 Image ID: docker-pullable://istio/proxyv2@sha256:87a9db561d2ef628deea7a4cbd0adf008a2f64355a2796e3b840d445b7e9cd3e Port: 15090/TCP Host Port: 0/TCP Args: proxy sidecar --domain $(POD_NAMESPACE).svc.cluster.local --serviceCluster random-example-2-78plm8nq.kubeflow --proxyLogLevel=warning --proxyComponentLogLevel=misc:error --log_output_level=default:info --concurrency 2 State: Running Started: Mon, 12 Jul 2021 22:18:02 -0400 Ready: True Restart Count: 0 Limits: cpu: 2 memory: 1Gi Requests: cpu: 10m memory: 40Mi Readiness: http-get http://:15021/healthz/ready delay=1s timeout=3s period=2s #success=1 #failure=30 Environment: JWT_POLICY: third-party-jwt PILOT_CERT_PROVIDER: istiod CA_ADDR: istiod.istio-system.svc:15012 POD_NAME: random-example-2-78plm8nq-d94xl (v1:metadata.name) POD_NAMESPACE: kubeflow (v1:metadata.namespace) INSTANCE_IP: (v1:status.podIP) SERVICE_ACCOUNT: (v1:spec.serviceAccountName) HOST_IP: (v1:status.hostIP) CANONICAL_SERVICE: (v1:metadata.labels['service.istio.io/canonical-name']) CANONICAL_REVISION: (v1:metadata.labels['service.istio.io/canonical-revision']) PROXY_CONFIG: {"tracing":{}}

  ISTIO_META_POD_PORTS:          [
  ISTIO_META_APP_CONTAINERS:     training-container-2
  ISTIO_META_CLUSTER_ID:         Kubernetes
  ISTIO_META_WORKLOAD_NAME:      random-example-2-78plm8nq
  ISTIO_META_OWNER:              kubernetes://apis/batch/v1/namespaces/kubeflow/jobs/random-example-2-78plm8nq
  ISTIO_META_MESH_ID:            cluster.local
  TRUST_DOMAIN:                  cluster.local
  /etc/istio/pod from istio-podinfo (rw)
  /etc/istio/proxy from istio-envoy (rw)
  /var/lib/istio/data from istio-data (rw)
  /var/run/secrets/istio from istiod-ca-cert (rw)
  /var/run/secrets/kubernetes.io/serviceaccount from default-token-ptcr5 (ro)
  /var/run/secrets/tokens from istio-token (rw)

metrics-logger-and-collector: Container ID: docker://2e9ec25e658ff7c1f93be9df9917389dd96aa8b0a12e6e8941621918e6225d70 Image: docker.io/kubeflowkatib/file-metrics-collector:v0.11.1 Image ID: docker-pullable://kubeflowkatib/file-metrics-collector@sha256:8e846b945b72c74269b5278fa282644537a54fb99f3f2e4b4c7f332117c253b8 Port: Host Port: Args: -t random-example-2-78plm8nq -m Validation-accuracy;Train-accuracy -o-type maximize -s-db katib-db-manager.kubeflow:6789 -path /var/log/katib/metrics.log State: Running Started: Mon, 12 Jul 2021 22:18:02 -0400 Ready: True Restart Count: 0 Limits: cpu: 500m ephemeral-storage: 5Gi memory: 100Mi Requests: cpu: 50m ephemeral-storage: 500Mi memory: 10Mi Environment: Mounts: /var/log/katib from metrics-volume (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-ptcr5 (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: istio-envoy: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: Memory SizeLimit: istio-data: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: istio-podinfo: Type: DownwardAPI (a volume populated by information about the pod) Items: metadata.labels -> labels metadata.annotations -> annotations limits.cpu -> cpu-limit requests.cpu -> cpu-request istio-token: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 43200 istiod-ca-cert: Type: ConfigMap (a volume populated by a ConfigMap) Name: istio-ca-root-cert Optional: false default-token-ptcr5: Type: Secret (a volume populated by a Secret) SecretName: default-token-ptcr5 Optional: false metrics-volume: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: QoS Class: Burstable Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events:

* describe pod ends ***

11mhg commented 3 years ago

I am getting this same issue. Did you manage to fix it? If so, how?

kwokon0ng commented 3 years ago

I am getting this same issue. Did you manage to fix it? If so, how?

No luck, still same issue.

andreyvelich commented 3 years ago

Hi @kwokon0ng and @11mhg, I think you forgot to disable istio-sidecar (sidecar.istio.io/inject: "false") for your Training containers. Please check step 3 here: https://www.kubeflow.org/docs/components/katib/hyperparameter/#example-using-random-algorithm.

kwokon0ng commented 3 years ago

Hi @kwokon0ng and @11mhg, I think you forgot to disable istio-sidecar (sidecar.istio.io/inject: "false") for your Training containers. Please check step 3 here: https://www.kubeflow.org/docs/components/katib/hyperparameter/#example-using-random-algorithm.

Thank you it works now

