kubeflow / metadata

Repository for assets related to Metadata.
Apache License 2.0
121 stars 69 forks source link

metadata-deployment pod error out but not crashing #219

Closed samuel-sujith closed 4 years ago

samuel-sujith commented 4 years ago

/kind bug

What steps did you take and what happened: .Installed Kubeflow 1.0 .metadata-deployment pod shows running but availability is 0/1 metadata-deployment-956d59c56-5m7ck 0/1 Running 2 32m

kubectl logs information for the pod is given below E0311 05:55:48.285674 1 main.go:98] Failed to create ML Metadata Store: mysql_real_connect failed: errno: 2002, error: Can't connect to local MySQL server through socket '/tmp/mysql.sock' (2). Retry 1/10. Sleep 1.052s E0311 05:55:49.339061 1 main.go:98] Failed to create ML Metadata Store: mysql_real_connect failed: errno: 2002, error: Can't connect to local MySQL server through socket '/tmp/mysql.sock' (2). Retry 2/10. Sleep 2.44s E0311 05:55:51.779886 1 main.go:98] Failed to create ML Metadata Store: mysql_real_connect failed: errno: 2002, error: Can't connect to local MySQL server through socket '/tmp/mysql.sock' (2). Retry 3/10. Sleep 4.329s E0311 05:55:56.109861 1 main.go:98] Failed to create ML Metadata Store: mysql_real_connect failed: errno: 2002, error: Can't connect to local MySQL server through socket '/tmp/mysql.sock' (2). Retry 4/10.

This repeats itself every 2 mins or so.

What did you expect to happen: Expectation was that metadata-deployment pod would be able to connect to the MYSQL server and work fine

Anything else you would like to add: Describe information for the pod is given below Events: Type Reason Age From Message


Normal Pulled 35m kubelet, esamsuj04 Container image "gcr.io/kubeflow-images-public/metadata:v0.1.11" already present on machine Normal Created 35m kubelet, esamsuj04 Created container container Normal Started 35m kubelet, esamsuj04 Started container container Normal Scheduled 35m default-scheduler Successfully assigned kubeflow/metadata-deployment-956d59c56-5m7ck to esamsuj04 Warning Unhealthy 48s (x419 over 35m) kubelet, esamsuj04 Readiness probe failed: Get http://10.244.3.66:8080/api/v1alpha1/artifact_types: dial tcp 10.244.3.66:8080: connect: connection refused

Environment: Ubuntu LTS16.04 on a VM

issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the labels:

Label Probability
bug 1.00

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

samuel-sujith commented 4 years ago

I think that there are some issue in connections between SQL and their controller. Below is the example of logs in the katib controller

root@esamsuj01:~/manifests# kubectl logs katib-db-manager-54b64f99b-kjgcs -n kubeflow E0310 07:52:11.531818 1 mysql.go:62] Ping to Katib db failed: dial tcp 10.105.210.35:3306: connect: connection refused E0310 07:52:16.531753 1 mysql.go:62] Ping to Katib db failed: dial tcp 10.105.210.35:3306: connect: connection refused E0310 07:52:21.531756 1 mysql.go:62] Ping to Katib db failed: dial tcp 10.105.210.35:3306: connect: connection refused E0310 07:52:26.531882 1 mysql.go:62] Ping to Katib db failed: dial tcp 10.105.210.35:3306: connect: connection refused E0310 07:52:31.531735 1 mysql.go:62] Ping to Katib db failed: dial tcp 10.105.210.35:3306: connect: connection refused E0310 07:52:36.531760 1 mysql.go:62] Ping to Katib db failed: dial tcp 10.105.210.35:3306: connect: connection refused E0310 07:52:41.531746 1 mysql.go:62] Ping to Katib db failed: dial tcp 10.105.210.35:3306: connect: connection refused E0310 07:52:46.531752 1 mysql.go:62] Ping to Katib db failed: dial tcp 10.105.210.35:3306: connect: connection refused E0310 07:52:51.531751 1 mysql.go:62] Ping to Katib db failed: dial tcp 10.105.210.35:3306: connect: connection refused E0310 07:52:56.531898 1 mysql.go:62] Ping to Katib db failed: dial tcp 10.105.210.35:3306: connect: connection refused E0310 07:53:01.531870 1 mysql.go:62] Ping to Katib db failed: dial tcp 10.105.210.35:3306: connect: connection refused I0310 07:53:05.540572 1 init.go:11] Initializing v1alpha3 DB schema I0310 07:53:05.733286 1 main.go:92] Start Katib manager: 0.0.0.0:6789

samuel-sujith commented 4 years ago

Just saw that in metadata/server/main.go Line no 48 says mySQLServiceHost = flag.String("mysql_service_host", "localhost", "MySQL Service Hostname.")

And i dont think this is overridden in the start of the container. can see that only http_port is overridden.

spec: containers:

samuel-sujith commented 4 years ago

Added the override for mysql_service_host in the container and now it seems to be working fine

root@esamsuj01:~/manifests/metadata/base# kubectl logs metadata-deployment-c54bd55b4-f24rs -n kubeflow E0312 10:13:04.801148 1 register.go:68] Ignored unknown category "container" with type "workspace" in "http://github.com/kubeflow/metadata/schema/alpha/containers/workspace.json"

doctorai-in commented 4 years ago

Added the override for mysql_service_host in the container and now it seems to be working fine

root@esamsuj01:~/manifests/metadata/base# kubectl logs metadata-deployment-c54bd55b4-f24rs -n kubeflow E0312 10:13:04.801148 1 register.go:68] Ignored unknown category "container" with type "workspace" in "http://github.com/kubeflow/metadata/schema/alpha/containers/workspace.json"

E0927 03:59:14.528452 1 main.go:98] Failed to create ML Metadata Store: mysql_real_connect failed: errno: 2002, error: Can't connect to local MySQL server through socket '/tmp/mysql.sock' (2). Retry 1/10. Sleep 1.052s E0927 03:59:15.583236 1 main.go:98] Failed to create ML Metadata Store: mysql_real_connect failed: errno: 2002, error: Can't connect to local MySQL server through socket '/tmp/mysql.sock' (2). Retry 2/10. Sleep 2.44s E0927 03:59:18.025071 1 main.go:98] Failed to create ML Metadata Store: mysql_real_connect failed: errno: 2002, error: Can't connect to local MySQL server through socket '/tmp/mysql.sock' (2). Retry 3/10. Sleep 4.329s E0927 03:59:22.355867 1 main.go:98] Failed to create ML Metadata Store: mysql_real_connect failed: errno: 2002, error: Can't connect to local MySQL server through socket '/tmp/mysql.sock' (2). Retry 4/10. Sleep 7.75s

where we should override mysql_service_host in the container

issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the labels:

Label Probability
area/katib 0.92

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.