Closed samuel-sujith closed 4 years ago
Issue-Label Bot is automatically applying the labels:
Label | Probability |
---|---|
bug | 1.00 |
Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.
I think that there are some issue in connections between SQL and their controller. Below is the example of logs in the katib controller
root@esamsuj01:~/manifests# kubectl logs katib-db-manager-54b64f99b-kjgcs -n kubeflow E0310 07:52:11.531818 1 mysql.go:62] Ping to Katib db failed: dial tcp 10.105.210.35:3306: connect: connection refused E0310 07:52:16.531753 1 mysql.go:62] Ping to Katib db failed: dial tcp 10.105.210.35:3306: connect: connection refused E0310 07:52:21.531756 1 mysql.go:62] Ping to Katib db failed: dial tcp 10.105.210.35:3306: connect: connection refused E0310 07:52:26.531882 1 mysql.go:62] Ping to Katib db failed: dial tcp 10.105.210.35:3306: connect: connection refused E0310 07:52:31.531735 1 mysql.go:62] Ping to Katib db failed: dial tcp 10.105.210.35:3306: connect: connection refused E0310 07:52:36.531760 1 mysql.go:62] Ping to Katib db failed: dial tcp 10.105.210.35:3306: connect: connection refused E0310 07:52:41.531746 1 mysql.go:62] Ping to Katib db failed: dial tcp 10.105.210.35:3306: connect: connection refused E0310 07:52:46.531752 1 mysql.go:62] Ping to Katib db failed: dial tcp 10.105.210.35:3306: connect: connection refused E0310 07:52:51.531751 1 mysql.go:62] Ping to Katib db failed: dial tcp 10.105.210.35:3306: connect: connection refused E0310 07:52:56.531898 1 mysql.go:62] Ping to Katib db failed: dial tcp 10.105.210.35:3306: connect: connection refused E0310 07:53:01.531870 1 mysql.go:62] Ping to Katib db failed: dial tcp 10.105.210.35:3306: connect: connection refused I0310 07:53:05.540572 1 init.go:11] Initializing v1alpha3 DB schema I0310 07:53:05.733286 1 main.go:92] Start Katib manager: 0.0.0.0:6789
Just saw that in metadata/server/main.go Line no 48 says mySQLServiceHost = flag.String("mysql_service_host", "localhost", "MySQL Service Hostname.")
And i dont think this is overridden in the start of the container. can see that only http_port is overridden.
spec: containers:
Added the override for mysql_service_host in the container and now it seems to be working fine
root@esamsuj01:~/manifests/metadata/base# kubectl logs metadata-deployment-c54bd55b4-f24rs -n kubeflow E0312 10:13:04.801148 1 register.go:68] Ignored unknown category "container" with type "workspace" in "http://github.com/kubeflow/metadata/schema/alpha/containers/workspace.json"
Added the override for mysql_service_host in the container and now it seems to be working fine
root@esamsuj01:~/manifests/metadata/base# kubectl logs metadata-deployment-c54bd55b4-f24rs -n kubeflow E0312 10:13:04.801148 1 register.go:68] Ignored unknown category "container" with type "workspace" in "http://github.com/kubeflow/metadata/schema/alpha/containers/workspace.json"
E0927 03:59:14.528452 1 main.go:98] Failed to create ML Metadata Store: mysql_real_connect failed: errno: 2002, error: Can't connect to local MySQL server through socket '/tmp/mysql.sock' (2). Retry 1/10. Sleep 1.052s E0927 03:59:15.583236 1 main.go:98] Failed to create ML Metadata Store: mysql_real_connect failed: errno: 2002, error: Can't connect to local MySQL server through socket '/tmp/mysql.sock' (2). Retry 2/10. Sleep 2.44s E0927 03:59:18.025071 1 main.go:98] Failed to create ML Metadata Store: mysql_real_connect failed: errno: 2002, error: Can't connect to local MySQL server through socket '/tmp/mysql.sock' (2). Retry 3/10. Sleep 4.329s E0927 03:59:22.355867 1 main.go:98] Failed to create ML Metadata Store: mysql_real_connect failed: errno: 2002, error: Can't connect to local MySQL server through socket '/tmp/mysql.sock' (2). Retry 4/10. Sleep 7.75s
where we should override mysql_service_host in the container
Issue-Label Bot is automatically applying the labels:
Label | Probability |
---|---|
area/katib | 0.92 |
Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.
/kind bug
What steps did you take and what happened: .Installed Kubeflow 1.0 .metadata-deployment pod shows running but availability is 0/1 metadata-deployment-956d59c56-5m7ck 0/1 Running 2 32m
kubectl logs information for the pod is given below E0311 05:55:48.285674 1 main.go:98] Failed to create ML Metadata Store: mysql_real_connect failed: errno: 2002, error: Can't connect to local MySQL server through socket '/tmp/mysql.sock' (2). Retry 1/10. Sleep 1.052s E0311 05:55:49.339061 1 main.go:98] Failed to create ML Metadata Store: mysql_real_connect failed: errno: 2002, error: Can't connect to local MySQL server through socket '/tmp/mysql.sock' (2). Retry 2/10. Sleep 2.44s E0311 05:55:51.779886 1 main.go:98] Failed to create ML Metadata Store: mysql_real_connect failed: errno: 2002, error: Can't connect to local MySQL server through socket '/tmp/mysql.sock' (2). Retry 3/10. Sleep 4.329s E0311 05:55:56.109861 1 main.go:98] Failed to create ML Metadata Store: mysql_real_connect failed: errno: 2002, error: Can't connect to local MySQL server through socket '/tmp/mysql.sock' (2). Retry 4/10.
This repeats itself every 2 mins or so.
What did you expect to happen: Expectation was that metadata-deployment pod would be able to connect to the MYSQL server and work fine
Anything else you would like to add: Describe information for the pod is given below Events: Type Reason Age From Message
Normal Pulled 35m kubelet, esamsuj04 Container image "gcr.io/kubeflow-images-public/metadata:v0.1.11" already present on machine Normal Created 35m kubelet, esamsuj04 Created container container Normal Started 35m kubelet, esamsuj04 Started container container Normal Scheduled 35m default-scheduler Successfully assigned kubeflow/metadata-deployment-956d59c56-5m7ck to esamsuj04 Warning Unhealthy 48s (x419 over 35m) kubelet, esamsuj04 Readiness probe failed: Get http://10.244.3.66:8080/api/v1alpha1/artifact_types: dial tcp 10.244.3.66:8080: connect: connection refused
Environment: Ubuntu LTS16.04 on a VM
Metadata version: One that comes with KF 1.0.0
Kubeflow version: 1.0.0
Minikube version:
Kubernetes version: (use
kubectl version
): Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:14:22Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:07:13Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}OS (e.g. from
/etc/os-release
): NAME="Ubuntu" VERSION="16.04.5 LTS (Xenial Xerus)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 16.04.5 LTS" VERSION_ID="16.04" HOME_URL="http://www.ubuntu.com/" SUPPORT_URL="http://help.ubuntu.com/" BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/" VERSION_CODENAME=xenial UBUNTU_CODENAME=xenial