Closed kwokon0ng closed 2 years ago
I am getting this same issue. Did you manage to fix it? If so, how?
I am getting this same issue. Did you manage to fix it? If so, how?
No luck, still same issue.
Hi @kwokon0ng and @11mhg, I think you forgot to disable istio-sidecar (sidecar.istio.io/inject: "false"
) for your Training containers.
Please check step 3 here: https://www.kubeflow.org/docs/components/katib/hyperparameter/#example-using-random-algorithm.
Hi @kwokon0ng and @11mhg, I think you forgot to disable istio-sidecar (
sidecar.istio.io/inject: "false"
) for your Training containers. Please check step 3 here: https://www.kubeflow.org/docs/components/katib/hyperparameter/#example-using-random-algorithm.
Thank you it works now
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hi, please help with my issue.
Environment: katib version : v0.11.1 Kubernetes version: v1.20.7 (minikube) installed from latest July 12 2021 manifest https://github.com/kubeflow/manifests/archive/refs/heads/master.zip
Tried twice with different experiment name " random-example" and "random-example-2"
NAMESPACE NAME READY STATUS RESTARTS AGE kubeflow random-example-2-78plm8nq-d94xl 2/3 NotReady 0 24h kubeflow random-example-2-random-56f555d986-xj6fl 1/1 Running 0 24h kubeflow random-example-2-txz7gzcm-sbhfl 2/3 NotReady 0 24h kubeflow random-example-2-zxfpdj99-wddq8 2/3 NotReady 0 24h kubeflow random-example-jnchd92g-6hzxm 2/3 NotReady 0 24h kubeflow random-example-ljkfjskn-7zvpz 2/3 NotReady 0 24h kubeflow random-example-mnj7dqw9-sptgh 2/3 NotReady 0 24h kubeflow random-example-random-64599d9574-p7mh7 1/1 Running 0 24h
Stuck for a day already: NAME TYPE STATUS AGE random-example Running True 24h random-example-2 Running True 24h
Below is describe pod output of random-example-2-78plm8nq-d94xl, don't really see error or reason in "NotReady" state.
* describe pod starts *** Name: random-example-2-78plm8nq-d94xl Namespace: kubeflow Priority: 0 Node: minikube-kf1.3/192.168.58.2 Start Time: Mon, 12 Jul 2021 22:17:56 -0400 Labels: controller-uid=59a82ade-3438-47f0-9a10-2fbc18dd9b68 istio.io/rev=default job-name=random-example-2-78plm8nq security.istio.io/tlsMode=istio service.istio.io/canonical-name=random-example-2-78plm8nq service.istio.io/canonical-revision=latest Annotations: kubectl.kubernetes.io/default-logs-container: training-container-2 prometheus.io/path: /stats/prometheus prometheus.io/port: 15020 prometheus.io/scrape: true sidecar.istio.io/status: {"initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["istio-envoy","istio-data","istio-podinfo","istio-token","istiod-... Status: Running IP: 172.17.0.67 IPs: IP: 172.17.0.67 Controlled By: Job/random-example-2-78plm8nq Init Containers: istio-init: Container ID: docker://62ed18945f67c85dc1761acdfcc49bc3dddf99a826e1f25bb649d1cc3baaf1b2 Image: docker.io/istio/proxyv2:1.9.6 Image ID: docker-pullable://istio/proxyv2@sha256:87a9db561d2ef628deea7a4cbd0adf008a2f64355a2796e3b840d445b7e9cd3e Port:
Host Port:
Args:
istio-iptables
-p
15001
-z
15006
-u
1337
-m
REDIRECT
-i
*
-x
Containers: training-container-2: Container ID: docker://5b8748c9df19c680641c2f24406f3c694d0251f5a1fa6d0c1dffdef9b1aec656 Image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-45c5727 Image ID: docker-pullable://kubeflowkatib/mxnet-mnist@sha256:9bbfc47d1fc369e79d0b4e83f26b3060941eb0d0792c758a4ce27b4bd90a6c48 Port:
Host Port:
Command:
sh
-c
Args:
python3 /opt/mxnet-mnist/mnist.py --batch-size=64 --lr=0.010837550665546624 --num-layers=3 --optimizer=ftrl 1>/var/log/katib/metrics.log 2>&1 && echo completed > /var/log/katib/$$$$.pid
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 12 Jul 2021 22:18:00 -0400
Finished: Mon, 12 Jul 2021 22:35:35 -0400
Ready: False
Restart Count: 0
Environment:
Mounts:
/var/log/katib from metrics-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-ptcr5 (ro)
istio-proxy:
Container ID: docker://de101074cf5d0f1ed6073f67b0bc8e967da558e3bfcef80941540f8397c0f917
Image: docker.io/istio/proxyv2:1.9.6
Image ID: docker-pullable://istio/proxyv2@sha256:87a9db561d2ef628deea7a4cbd0adf008a2f64355a2796e3b840d445b7e9cd3e
Port: 15090/TCP
Host Port: 0/TCP
Args:
proxy
sidecar
--domain
$(POD_NAMESPACE).svc.cluster.local
--serviceCluster
random-example-2-78plm8nq.kubeflow
--proxyLogLevel=warning
--proxyComponentLogLevel=misc:error
--log_output_level=default:info
--concurrency
2
State: Running
Started: Mon, 12 Jul 2021 22:18:02 -0400
Ready: True
Restart Count: 0
Limits:
cpu: 2
memory: 1Gi
Requests:
cpu: 10m
memory: 40Mi
Readiness: http-get http://:15021/healthz/ready delay=1s timeout=3s period=2s #success=1 #failure=30
Environment:
JWT_POLICY: third-party-jwt
PILOT_CERT_PROVIDER: istiod
CA_ADDR: istiod.istio-system.svc:15012
POD_NAME: random-example-2-78plm8nq-d94xl (v1:metadata.name)
POD_NAMESPACE: kubeflow (v1:metadata.namespace)
INSTANCE_IP: (v1:status.podIP)
SERVICE_ACCOUNT: (v1:spec.serviceAccountName)
HOST_IP: (v1:status.hostIP)
CANONICAL_SERVICE: (v1:metadata.labels['service.istio.io/canonical-name'])
CANONICAL_REVISION: (v1:metadata.labels['service.istio.io/canonical-revision'])
PROXY_CONFIG: {"tracing":{}}
metrics-logger-and-collector: Container ID: docker://2e9ec25e658ff7c1f93be9df9917389dd96aa8b0a12e6e8941621918e6225d70 Image: docker.io/kubeflowkatib/file-metrics-collector:v0.11.1 Image ID: docker-pullable://kubeflowkatib/file-metrics-collector@sha256:8e846b945b72c74269b5278fa282644537a54fb99f3f2e4b4c7f332117c253b8 Port:
Host Port:
Args:
-t
random-example-2-78plm8nq
-m
Validation-accuracy;Train-accuracy
-o-type
maximize
-s-db
katib-db-manager.kubeflow:6789
-path
/var/log/katib/metrics.log
State: Running
Started: Mon, 12 Jul 2021 22:18:02 -0400
Ready: True
Restart Count: 0
Limits:
cpu: 500m
ephemeral-storage: 5Gi
memory: 100Mi
Requests:
cpu: 50m
ephemeral-storage: 500Mi
memory: 10Mi
Environment:
Mounts:
/var/log/katib from metrics-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-ptcr5 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
istio-envoy:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: Memory
SizeLimit:
istio-data:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit:
istio-podinfo:
Type: DownwardAPI (a volume populated by information about the pod)
Items:
metadata.labels -> labels
metadata.annotations -> annotations
limits.cpu -> cpu-limit
requests.cpu -> cpu-request
istio-token:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 43200
istiod-ca-cert:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: istio-ca-root-cert
Optional: false
default-token-ptcr5:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-ptcr5
Optional: false
metrics-volume:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit:
QoS Class: Burstable
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
* describe pod ends ***