kubeflow / example-seldon

Example for end-to-end machine learning on Kubernetes using Kubeflow and Seldon Core
Apache License 2.0
171 stars 58 forks source link

predict fails and seldondeployment missing .status #35

Open DavidLangworthy opened 5 years ago

DavidLangworthy commented 5 years ago

@cliveseldon Calling predict on a deployment that returned sucess fails with a connection error. Attempting to debug this reveals that .status is missing from seldondeployment. Sugestions for how to debug this?

!kubectl get seldondeployments mnist-classifier -o jsonpath='{.status}'

returns nothing

!kubectl get seldondeployments mnist-classifier -o json returns { "apiVersion": "machinelearning.seldon.io/v1alpha2", "kind": "SeldonDeployment", "metadata": { "annotations": { "kubectl.kubernetes.io/last-applied-configuration": "{\"apiVersion\":\"machinelearning.seldon.io/v1alpha2\",\"kind\":\"SeldonDeployment\",\"metadata\":{\"annotations\":{},\"labels\":{\"app\":\"seldon\"},\"name\":\"mnist-classifier\",\"namespace\":\"kubeflow\"},\"spec\":{\"annotations\":{\"deployment_version\":\"v1\",\"project_name\":\"MNIST Example\",\"seldon.io/engine-separate-pod\":\"false\",\"seldon.io/rest-connection-timeout\":\"100\"},\"name\":\"mnist-classifier\",\"predictors\":[{\"annotations\":{\"predictor_version\":\"v1\"},\"componentSpecs\":[{\"spec\":{\"containers\":[{\"image\":\"seldonio/deepmnistclassifier_runtime:0.2\",\"imagePullPolicy\":\"Always\",\"name\":\"tf-model\",\"volumeMounts\":[{\"mountPath\":\"/data\",\"name\":\"persistent-storage\"}]}],\"terminationGracePeriodSeconds\":1,\"volumes\":[{\"name\":\"persistent-storage\",\"volumeSource\":{\"persistentVolumeClaim\":{\"claimName\":\"nfs-1\"}}}]}}],\"graph\":{\"children\":[],\"endpoint\":{\"type\":\"REST\"},\"name\":\"tf-model\",\"type\":\"MODEL\"},\"name\":\"mnist-classifier\",\"replicas\":1}]}}\n" }, "creationTimestamp": "2019-04-18T21:26:32Z", "generation": 1, "labels": { "app": "seldon" }, "name": "mnist-classifier", "namespace": "kubeflow", "resourceVersion": "128631", "selfLink": "/apis/machinelearning.seldon.io/v1alpha2/namespaces/kubeflow/seldondeployments/mnist-classifier", "uid": "a3450e71-6220-11e9-a023-da0ed60f5a55" }, "spec": { "annotations": { "deployment_version": "v1", "project_name": "MNIST Example", "seldon.io/engine-separate-pod": "false", "seldon.io/rest-connection-timeout": "100" }, "name": "mnist-classifier", "predictors": [ { "annotations": { "predictor_version": "v1" }, "componentSpecs": [ { "spec": { "containers": [ { "image": "seldonio/deepmnistclassifier_runtime:0.2", "imagePullPolicy": "Always", "name": "tf-model", "volumeMounts": [ { "mountPath": "/data", "name": "persistent-storage" } ] } ], "terminationGracePeriodSeconds": 1, "volumes": [ { "name": "persistent-storage", "volumeSource": { "persistentVolumeClaim": { "claimName": "nfs-1" } } } ] } } ], "graph": { "children": [], "endpoint": { "type": "REST" }, "name": "tf-model", "type": "MODEL" }, "name": "mnist-classifier", "replicas": 1 } ] } }

ukclivecox commented 5 years ago

Can you check the logs of the cluster-manager and check the pods are running. There should always be a status so need to track this down further.

DavidLangworthy commented 5 years ago

What specifically do I need to look for? Kubeflow starts up so much it's hard to find my way around.

DavidLangworthy commented 5 years ago

!kubectl get pods -n kubeflow

NAME READY STATUS RESTARTS AGE ambassador-c9647fb66-fl4zr 1/1 Running 0 1d ambassador-c9647fb66-g6n9r 1/1 Running 0 1d ambassador-c9647fb66-z7p27 1/1 Running 0 1d argo-ui-755fcfc656-s2rgl 1/1 Running 0 1d centraldashboard-7c948d9df6-jh8zj 1/1 Running 0 1d jupyter-0 1/1 Running 0 1d jupyter-web-app-6ffc57d749-mqtgr 0/1 CrashLoopBackOff 318 1d katib-ui-6dc644d54-jg6mj 1/1 Running 0 1d kubeflow-r-train-srxtq-1399384440 0/1 Completed 0 23h kubeflow-sk-train-6llnn-122502152 0/1 Completed 0 23h kubeflow-tf-train-nc5kg-1269457206 0/1 Completed 0 23h metacontroller-0 1/1 Running 0 1d minio-b7595688d-4xhbq 1/1 Running 0 1d ml-pipeline-59459675dd-npjh6 1/1 Running 0 1d ml-pipeline-persistenceagent-7f6d4555d7-hdkmn 1/1 Running 1 1d ml-pipeline-scheduledworkflow-5f4d44fb4f-65xt9 1/1 Running 0 1d ml-pipeline-ui-f5d595697-z8cl5 1/1 Running 0 1d ml-pipeline-viewer-controller-deployment-5b4954fb4c-4ldm8 1/1 Running 0 1d mnist-train-5-worker-0 0/1 Completed 0 23h mykubeflowapp2-controller-b5677fccf-5fpsm 1/1 Running 0 1d mysql-5b7578d9f5-8mjld 1/1 Running 0 1d notebooks-controller-9c5f6b7f5-t2xlh 1/1 Running 0 1d profiles-7bfcbd5f76-2ht9w 1/1 Running 0 1d pytorch-operator-847d884f4d-cvwpm 1/1 Running 0 1d r-train-mfs75 0/1 Completed 0 23h sk-train-svnwb 0/1 Completed 0 23h spartakus-volunteer-7787b4cf54-z79tj 1/1 Running 0 1d studyjob-controller-5995857687-46xrn 1/1 Running 0 1d tf-job-dashboard-c899cd664-94wtf 1/1 Running 0 1d tf-job-operator-785546f859-rfzrm 1/1 Running 0 1d vizier-core-6d56d75f76-969ks 1/1 Running 3 1d vizier-core-rest-79bdbfbfb8-qnvz9 1/1 Running 0 1d vizier-db-79d57d5667-f7nst 1/1 Running 0 1d vizier-suggestion-bayesianoptimization-759f6c56c8-54p6x 1/1 Running 0 1d vizier-suggestion-grid-59f7f5646d-fqcfg 1/1 Running 0 1d vizier-suggestion-hyperband-84b8ddc658-xm9fb 1/1 Running 0 1d vizier-suggestion-random-64b4467f6b-gptpl 1/1 Running 0 1d workflow-controller-8564bd964f-df7x2 1/1 Running 0 1d

ukclivecox commented 5 years ago

I don't see the seldon cluster-manager. Did you install seldon as per the docs?

DavidLangworthy commented 5 years ago

Yes, but I gather it was not successful. I will try again.

Thank you

DavidLangworthy commented 5 years ago

The deployment worked this time and the cluster manager is up: dlan@loadclient:~$ kubectl get pods --all-namespaces | grep seldon kube-system seldon-spartakus-volunteer-57647c7679-vb6pt 1/1 Running 0 1d kubeflow seldon-core-ambassador-6bb6fb974d-qwg79 1/1 Running 0 1m kubeflow seldon-core-redis-685dd67c95-grv2h 1/1 Running 0 1m kubeflow seldon-core-seldon-cluster-manager-dd8497ccf-xtm46 1/1 Running 0 1m

However I am still getting an error calling the prediction service.

ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

The port forward window gives me the following:

dlan@loadclient:~$ kubectl port-forward $(kubectl get pods -n kubeflow -l service=ambassador -o jsonpath='{.items[0].metadata.name}') -n kubeflow 8002:80 Forwarding from 127.0.0.1:8002 -> 80 Forwarding from [::1]:8002 -> 80 Handling connection for 8002 E0419 21:38:55.183309 12957 portforward.go:400] an error occurred forwarding 8002 -> 80: error forwarding port 80 to pod baa7cdd3e0fc3d4ce1d30ff49cd8602421ebce99f6895fdb5aa70e1e362051f9, uid : exit status 1: 2019/04/19 21:38:55 socat[9620] E connect(6, AF=2 127.0.0.1:80, 16): Connection refused Handling connection for 8002 E0419 21:38:58.731598 12957 portforward.go:400] an error occurred forwarding 8002 -> 80: error forwarding port 80 to pod baa7cdd3e0fc3d4ce1d30ff49cd8602421ebce99f6895fdb5aa70e1e362051f9, uid : exit status 1: 2019/04/19 21:38:58 socat[9798] E connect(6, AF=2 127.0.0.1:80, 16): Connection refused Handling connection for 8002 E0419 21:39:27.769533 12957 portforward.go:400] an error occurred forwarding 8002 -> 80: error forwarding port 80 to pod baa7cdd3e0fc3d4ce1d30ff49cd8602421ebce99f6895fdb5aa70e1e362051f9, uid : exit status 1: 2019/04/19 21:39:27 socat[10904] E connect(6, AF=2 127.0.0.1:80, 16): Connection refused

ukclivecox commented 5 years ago

OK. Can you check the Ambassador exposes port 80 or has moved to 8080 now?

DavidLangworthy commented 5 years ago

I have two ambassadors ambassador ClusterIP 10.0.233.236 80/TCP seldon-core-ambassador NodePort 10.0.158.182 80:30489/TCP, 443:31294/TCP

Thanks for your help.

ukclivecox commented 5 years ago

I would try connecting to both Ambassadors directly to see which ones work and also check the Ambassador diagnostics.

DavidLangworthy commented 5 years ago

I’ll try that.

Thank you

.


From: cliveseldon notifications@github.com Sent: Tuesday, April 23, 2019 12:08:19 AM To: kubeflow/example-seldon Cc: David Langworthy; Author Subject: Re: [kubeflow/example-seldon] predict fails and seldondeployment missing .status (#35)

I would try connecting to both Ambassadors directly to see which ones work and also check the Ambassador diagnostics.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkubeflow%2Fexample-seldon%2Fissues%2F35%23issuecomment-485671054&data=02%7C01%7Cdlan%40microsoft.com%7Cb292245c5cba44bc346908d6c7ba7775%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636916001009305051&sdata=FntQsN1Xvz2yH%2F5f0%2BUxyys8QcCGugEsbesTvIePXis%3D&reserved=0, or mute the threadhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAA5NW6IVFDDVGPLHA3SC2A3PR2YWHANCNFSM4HHH2O5Q&data=02%7C01%7Cdlan%40microsoft.com%7Cb292245c5cba44bc346908d6c7ba7775%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636916001009315064&sdata=Up9cY4Zwa%2BSO0BUlp8J%2FOhrzsLqDz1FUKD5vrXiku%2Bc%3D&reserved=0.

DavidLangworthy commented 5 years ago

I can hit the predictor directly and it works fine. The routes look fine in ambassador. However I do not see requests in the ambassador logs.

Any suggestions?

I'll keep looking around.

ukclivecox commented 5 years ago

Sorry, missed this. You won't see requests in the Ambassador logs by default I think as Ambassador doesn't logs every request. Are the requests working?

DavidLangworthy commented 5 years ago

The requests were not working. I've recycled this cluster. I'll bring up a fresh one and see if there is a repro.

Thank you.