Getting error while running workflow on kubernetes

narendra36 commented 4 years ago

Hi,

I'm using minikube to run kubernetes in local system and trying to run workflow defined in demos/sklearn-pipe/sklearn-project.ipynb but getting the below error message.

Jupyter Cell:

artifact_path = path.abspath('./pipe/{{workflow.uid}}')

run_id = skproj.run(
    'main',
    arguments={}, 
    artifact_path=artifact_path, 
    dirty=True)

Error message: MaxRetryError: HTTPConnectionPool(host='ml-pipeline.default.svc.cluster.local', port=8888): Max retries exceeded with url: /apis/v1beta1/experiments (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fea36705a90>: Failed to establish a new connection: [Errno -2] Name or service not known'))

I have followed the instructions mentioned in below readme file https://github.com/mlrun/mlrun/blob/master/hack/local/README.md

Can anyone help me in resolving the error?

narendra36 commented 4 years ago

Hi, Do I also have to install and setup the Kubflow pipeline separately or it's installed by default with MLRun? Can anybody provide little bit more detail about setting up MLRun on local system?

As I'm using minikube to setup a kubernates cluster with single node.

What are the required installation as I have followed the instruction mentioned in https://github.com/mlrun/mlrun/blob/master/hack/local/README.md but still getting some errors?

narendra36 commented 4 years ago

Hi, @yaronha and @tebeka do I need to setup and install kubeflow on Kubernetes cluster separately because It is not mentioned anywhere and I'm getting above mentioned error while deploying the workflow pipeline on Kubernetes cluster?

Can you also please help me with queries asked in previous comment?

yaronha commented 4 years ago

@narendra36 you can start w/o Kubeflow, Kubeflow is needed for the Pipelines and CRDs like MPIJob and should be installed on the cluster

narendra36 commented 4 years ago

Hi @yaronha, thanks for the reply. I have installed the Kubeflow pipeline on my GCP Kubernetes cluster and able to solve above error but still I'm getting 2 major issues while deploying a sample sklearn-pipeline

Issue 1:

Screenshot from 2020-05-02 21-45-03

The above cell in the image is running successfully and also checked that artifact dataset file created successfully but when I'm running describe function cell it is giving me the following error. Screenshot from 2020-05-02 21-57-51

Can you please tell me what am I missing here? I have checked that file exists but the error message is that the dataset file doesn't exist.

Issue 2:

When I deploy the Kubeflow pipeline there is no error message but the pod for running to executing the pipeline is failed. error logs is here. the URL mentioned in the output of the cell is giving 404.

Cell output:

Screenshot from 2020-05-02 22-07-56

Kubernetes cluster error log:
Screenshot from 2020-05-02 22-08-54

Here are the details of the error logs of the failed pod for pipeline deployment. https://paste.ubuntu.com/p/xDRnfdFYzt/

=========================================================

Please help me to get resolve these errors because I'm unable to explore MLRun further without resolving these errors. Thanks in advance :smile:

@yaronha and @tebeka, Can anyone provide me the direction to solve these issues, and what could be the cause?

ghost commented 4 years ago

for issue 1 try (if you followed the instructions):

from mlrun.platforms import mount_pvc
fn.apply(mount_pvc("nfsvol", "nfsvol", "/home/joyan/data"))

narendra36 commented 4 years ago

Hi @yjb-ds, Yes. issue 1 is fixed with the changes mentioned by you. I think there should be more details about a function like mount_pvc etc. Thank you for your help :)

I also have fixed this thing in pipeline workflow code but issue 2 still persists. I have followed the instructions but don't know what else I'm missing.

This is the error msg from failed pod: [sh -c docker cp -a a586b7ed68c4e75ea4d2ee204bb1ebd1b5847ac187dd92b3d3105c5c7af9237d:/tmp/image - | tar -ax -O] stderr:\nError: No such container:path: a586b7ed68c4e75ea4d2ee204bb1ebd1b5847ac187dd92b3d3105c5c7af9237d:/tmp/image\ntar: This does not look like a tar archive\ntar: Exiting with failure status due to previous errors\n"

Detailed logs of pipeline workflow pod are here.

@yjb-ds, Is the above-mentioned changes will work for pipeline workflow also?

Screenshot from 2020-05-03 01-10-42

yaronha commented 4 years ago

@narendra36 looks like the builder pod (the first step) failed to store the image , can you share the log of the first step (in kubeflow pipeline UI), i guess it may be related to the build/registry setup (you need to configure the docker registry)

ghost commented 4 years ago

If it is the docker setup, then you might try the following:

get an access token for your docker hub account

copy it into the mlrun-local.yaml file:

    - name: DEFAULT_DOCKER_REGISTRY
      value: "https://index.docker.io/v1/"
    - name: DEFAULT_DOCKER_SECRET
      value: "<your-access-token>"

create a docker secret in your cluster:

kubectl create -n kubeflow secret docker-registry my-docker     --docker-server=https://index.docker.io/v1/  --docker-username=<docker user> --docker-password=<docker acces token> --docker-email=<email>

restart the mlrun-api

narendra36 commented 4 years ago

Hi @yaronha, as you rightly said it is the problem with the docker registry. I have already setup a docker hub account and added an access token while creating my-docker secret but the error remains the same.

I also have followed the instructions mentioned by @yjb-ds but the issue is still there.

@yjb-ds, Should the value of DEFAULT_DOCKER_SECRET be the name of created docker secret 'my-docker' or access token from docker hub?

        - name: DEFAULT_DOCKER_SECRET
          value: "my-docker"

@yaronha, Here are the logs of the first step of the pipeline:

Screenshot from 2020-05-03 13-50-33

Screenshot from 2020-05-03 14-38-35

Here is the my mlrun-local.yaml

yaronha commented 4 years ago

@narendra36 you should set both DEFAULT_DOCKER_REGISTRY (url, e.g. https://index.docker.io/v1/) and DEFAULT_DOCKER_SECRET (k8s secret name, in the same namespace), another way is to set the registry per function, add this method to your function object fn.build_config(image='target/image:tag', secret='my_docker')

narendra36 commented 4 years ago

@yaronha, thanks for your quick reply.

I have tried again by setting the registry per function as you suggested and the first step works fine but when it comes to 2nd step 'get-data' it is giving some kind of error with the reason of forbidden. I think it's trying to create a resource pod on the default-tenant namespace. But my complete deployment is under kubeflow namespace and I also have set namespace parameter to kubeflow while running the pipeline as below.

artifact_path = path.abspath('./pipe/{{workflow.uid}}')
run_id = skproj.run(
    'main',
    arguments={}, 
    artifact_path=artifact_path,
    dirty=True,
    namespace='kubeflow')

Error logs from Kubeflow pipeline UI:

[mlrun] 2020-05-03 14:48:19,149 failed to create pod: (403)
Reason: Forbidden
HTTP response headers: HTTPHeaderDict({'Audit-Id': '778b1a74-ddc3-415e-89da-582c2e13b3fa', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Sun, 03 May 2020 14:48:19 GMT', 'Content-Length': '300'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods is forbidden: User \"system:serviceaccount:kubeflow:pipeline-runner\" cannot create resource \"pods\" in API group \"\" in the namespace \"default-tenant\"","reason":"Forbidden","details":{"kind":"pods"},"code":403}
[mlrun] 2020-05-03 14:48:19,177 run executed, status=error
runtime error: (403)
Reason: Forbidden
HTTP response headers: HTTPHeaderDict({'Audit-Id': '778b1a74-ddc3-415e-89da-582c2e13b3fa', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Sun, 03 May 2020 14:48:19 GMT', 'Content-Length': '300'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods is forbidden: User \"system:serviceaccount:kubeflow:pipeline-runner\" cannot create resource \"pods\" in API group \"\" in the namespace \"default-tenant\"","reason":"Forbidden","details":{"kind":"pods"},"code":403}
runtime error: (403)
Reason: Forbidden
HTTP response headers: HTTPHeaderDict({'Audit-Id': '778b1a74-ddc3-415e-89da-582c2e13b3fa', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Sun, 03 May 2020 14:48:19 GMT', 'Content-Length': '300'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods is forbidden: User \"system:serviceaccount:kubeflow:pipeline-runner\" cannot create resource \"pods\" in API group \"\" in the namespace \"default-tenant\"","reason":"Forbidden","details":{"kind":"pods"},"code":403}

Screenshot from 2020-05-03 20-29-37

@yjb-ds @yaronha What could be the fix for it? :smile:

yaronha commented 4 years ago

@narendra36 we are getting there :) , im adding more doc based on those issues

you should set the default mlrun namespace (set mlconf.namespace = 'kubeflow') and pods will be created on that one, may also need to set it in the service (env var MLRUN_NAMESPACE)

narendra36 commented 4 years ago

Hi @yaronha, Yes please add more docs regarding the issues and setup. I will also share the installation details once it has been set up completely.

Hopefully, we will be there soon :)

I have configured namespace as mentioned by you but no luck the same error still persists. What could be the other reason?

I have looked for the namespace value 'default-tenant' and found that one of the mentioned is inside the runtimes/mpijob.py file.

yaronha commented 4 years ago

@narendra36 i found an issue with kfp namespaces, issued a fix, we have a new ver w many new features in 1-2 days will have it in

u can try using the dev branch:

!pip uninstall mlrun -y
!pip install git+https://github.com/mlrun/mlrun@development

narendra36 commented 4 years ago

Hi @yaronha and @yjb-ds, that's good to know. But for now, I have installed the mlrun from the development branch as suggested by @yaronha.

When I start the workflow pipeline to execute, kubeflow pipeline UI and Kubernetes pod status showing below progress: Step 1 of the workflow pipeline (deploy-gen-iris): successfully completed.

Here are the logs for the first pod which completed without any error.

Screenshot from 2020-05-05 18-45-50

Step 2 of the workflow pipeline (get-data):

pod error: Failed to pull image "dodwaria/mlrun-test:latest": rpc error: code = Unknown desc = Error response from daemon: manifest for dodwaria/mlrun-test:latest not found

Screenshot from 2020-05-05 19-03-36

Screenshot from 2020-05-05 19-01-01

Further steps are failing with below errors:

Screenshot from 2020-05-05 19-01-08

I have checked the requested container image with the latest tag on the docker hub and the container image does not exist there. But the last access of token is also getting updated which means access to docker hub working fine but why the image built in the first step is not uploaded on docker hub?

I also have checked the pod logs for the first step which is not showing any docker command to push the image to dockerhub but it includes some minio-service calls (which was being set up while configuring kubeflow pipeline service). Pod logs are Here

how can I fix it? Here is my updated Jupyter file please review it once. I might be missing something. sklearn-pipeline-demo

Thanks for helping! :)

yaronha commented 4 years ago

@narendra36 u can see in the log that your server is 0.4.6, i suggest updating the docker to mlrun/mlrun-api:0.4.7 to match the client

yaronha commented 4 years ago

@narendra36 i see the error, u should re-pull the image mlrun/ml-models:0.4.7, it was broken at a certain point (missing the package)

yaronha commented 4 years ago

@narendra36 mlrun 0.4.7 is now released, i suggest re-install the packages & containers

narendra36 commented 4 years ago

Hi @yaronha, thanks for the help. Somehow, I managed to get there till the model serving step. :smile: But getting error in model serving step. Can you suggest what could be the reason for it?

[mlrun] 2020-05-15 19:01:56,318 deploy started
deploy error: HTTPConnectionPool(host='localhost', port=8070): Max retries exceeded with url: /api/projects (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f75598daf60>: Failed to establish a new connection: [Errno 111] Connection refused',))

Screenshot from 2020-05-16 01-01-24

yaronha commented 4 years ago

@narendra36 the serving part is done with Nuclio serverless engine, did you install Nuclio on your cluster? you would need to specify the nuclio service (dashboard) url in the serving function deploy_step()

narendra36 commented 4 years ago

Hi @yaronha, @yjb-ds I have deployed Nuclio on my cluster and tried again.

Observations:

The model has been deployed successfully and able to test it from the Nuclio UI dashboard.
deploy-mode-server-v2 step returning the endpoint URL, of the deployed model, to model tester step is: http://localhost:31844

Screenshot from 2020-05-24 14-52-34

But when I see actual deployment endpoint which is: http://nuclio-sk-project-mode-server-v2.kubeflow:8080

Screenshot from 2020-05-24 14-50-11

because of the mismatch of the endpoints the final model-tester step producing the error: Max retries exceeded with url http://localhost:31844

Issues:

How can I match up these two URLs? Do I have to hard code the actual URL of the model serving http://nuclio-sk-project-mode-server-v2.kubeflow:8080?
When I retry the pipeline, Nuclio try to deploy 8 replica pod ( for model serving ) by default which exceeds my resources uses. How can I control that from the workflow pipeline code?

yaronha commented 4 years ago

@narendra36 good to see we are in the final step, the problem is that Nuclio external IP was not configured, u need to configure that either duting installation (when using Helm: --set dashboard.externalIPAddresses ["ip-1", "ip-2", "...etc"])

or by adding the env var to the dashboard deployment NUCLIO_DASHBOARD_EXTERNAL_IP_ADDRESSES with the host ip/name

BTW did u set the dashbooard address in the deploy_step() or it worked without it (it should work automatically if the dashboard service is called nuclio-dashboard in the same namespace with the pipeline pod)

narendra36 commented 4 years ago

Hi @yaronha, Thanks for the quick reply. :smile: I have deployed Nuclio under the same namespace and with the nuclio-dashboard name so It worked without adding deploy_step().

So do I still have to set the above-mentioned environment variable and How should I add it while installing Nuclio or mlrun? I didn't understand.

yaronha commented 4 years ago

@narendra36 its part of nuclio installation (tells nuclio what is the external address to the dashboard & ingress), u can change/set the env var in the nuclio-dashboard deployment, with new installs u do it in the nuclio helm values

Hedingber commented 3 years ago

Issue seems to be resolved, closing

Vin-itall commented 3 years ago

@narendra36 you should set both DEFAULT_DOCKER_REGISTRY (url, e.g. https://index.docker.io/v1/) and DEFAULT_DOCKER_SECRET (k8s secret name, in the same namespace), another way is to set the registry per function, add this method to your function object fn.build_config(image='target/image:tag', secret='my_docker')

I have these set but I am still facing this issue.

Screenshot 2021-02-05 at 10 02 31 AM

kubectl logs demo-training-pipeline-bn97r-3933567025 -n kubeflow wait

time="2021-02-05T16:43:48Z" level=info msg="Starting Workflow Executor" version=v2.7.5+ede163e.dirty
time="2021-02-05T16:43:49Z" level=info msg="Creating PNS executor (namespace: kubeflow, pod: demo-training-pipeline-bn97r-3933567025, pid: 6, hasOutputs: true)"
time="2021-02-05T16:43:49Z" level=info msg="Executor (version: v2.7.5+ede163e.dirty, build_date: 2020-04-21T01:12:08Z) initialized (pod: kubeflow/demo-training-pipeline-bn97r-3933567025) with template:\n{\"name\":\"deploy-gen-iris\",\"arguments\":{},\"inputs\":{},\"outputs\":{\"parameters\":[{\"name\":\"deploy-gen-iris-image\",\"valueFrom\":{\"path\":\"/tmp/image\"}}],\"artifacts\":[{\"name\":\"deploy-gen-iris-image\",\"path\":\"/tmp/image\"},{\"name\":\"deploy-gen-iris-state\",\"path\":\"/tmp/state\"}]},\"metadata\":{\"annotations\":{\"sidecar.istio.io/inject\":\"false\"},\"labels\":{\"pipelines.kubeflow.org/cache_enabled\":\"true\"}},\"container\":{\"name\":\"\",\"image\":\"mlrun/mlrun:0.5.5-rc3\",\"command\":[\"python\",\"-m\",\"mlrun\",\"build\",\"--kfp\",\"-r\",\"{'kind': 'job', 'metadata': {'name': 'gen-iris', 'tag': '', 'project': 'sk-project'}, 'spec': {'command': '', 'args': [], 'volumes': [{'name': 'pvc-55208e8c-6cf1-483f-a107-bea804c96384', 'persistentVolumeClaim': {'claimName': 'mlrun-kit-jupyter-pvc'}}], 'volume_mounts': [{'mountPath': '/home/jovyan/data', 'name': 'pvc-55208e8c-6cf1-483f-a107-bea804c96384'}], 'env': [], 'default_handler': '', 'entry_points': {'iris_generator': {'name': 'iris_generator', 'doc': '', 'parameters': [{'name': 'context', 'default': ''}, {'name': 'format', 'default': 'csv'}], 'outputs': [{'default': ''}], 'lineno': 11}}, 'description': '', 'build': {'functionSourceCode': 'IyBHZW5lcmF0ZWQgYnkgbnVjbGlvLmV4cG9ydC5OdWNsaW9FeHBvcnRlcgoKaW1wb3J0IG9zCmZyb20gc2tsZWFybi5kYXRhc2V0cyBpbXBvcnQgbG9hZF9pcmlzCmZyb20gc2tsZWFybi5tb2RlbF9zZWxlY3Rpb24gaW1wb3J0IHRyYWluX3Rlc3Rfc3BsaXQKaW1wb3J0IG51bXB5IGFzIG5wCmZyb20gc2tsZWFybi5tZXRyaWNzIGltcG9ydCBhY2N1cmFjeV9zY29yZQpmcm9tIG1scnVuLmFydGlmYWN0cyBpbXBvcnQgVGFibGVBcnRpZmFjdCwgUGxvdEFydGlmYWN0CmltcG9ydCBwYW5kYXMgYXMgcGQKCmRlZiBpcmlzX2dlbmVyYXRvcihjb250ZXh0LCBmb3JtYXQ9J2NzdicpOgogICAgaXJpcyA9IGxvYWRfaXJpcygpCiAgICBpcmlzX2RhdGFzZXQgPSBwZC5EYXRhRnJhbWUoZGF0YT1pcmlzLmRhdGEsIGNvbHVtbnM9aXJpcy5mZWF0dXJlX25hbWVzKQogICAgaXJpc19sYWJlbHMgPSBwZC5EYXRhRnJhbWUoZGF0YT1pcmlzLnRhcmdldCwgY29sdW1ucz1bJ2xhYmVsJ10pCiAgICBpcmlzX2RhdGFzZXQgPSBwZC5jb25jYXQoW2lyaXNfZGF0YXNldCwgaXJpc19sYWJlbHNdLCBheGlzPTEpCiAgICAKICAgIGNvbnRleHQubG9nZ2VyLmluZm8oJ3NhdmluZyBpcmlzIGRhdGFmcmFtZSB0byB7fScuZm9ybWF0KGNvbnRleHQuYXJ0aWZhY3RfcGF0aCkpCiAgICBjb250ZXh0LmxvZ19kYXRhc2V0KCdpcmlzX2RhdGFzZXQnLCBkZj1pcmlzX2RhdGFzZXQsIGZvcm1hdD1mb3JtYXQsIGluZGV4PUZhbHNlKQoK', 'base_image': 'mlrun/mlrun', 'commands': ['pip install sklearn', 'pip install pyarrow']}}}\",\"--with_mlrun\",\"--skip\"],\"env\":[{\"name\":\"DEFAULT_DOCKER_REGISTRY\",\"value\":\"index.docker.io/falkonryml\"},{\"name\":\"MLRUN_NAMESPACE\",\"valueFrom\":{\"fieldRef\":{\"fieldPath\":\"metadata.namespace\"}}},{\"name\":\"MLRUN_ARTIFACT_PATH\",\"value\":\"/home/jovyan/demos/scikit-learn-pipeline/pipe/2536a5ff-5740-44d8-96f8-26316be0611c\"}],\"resources\":{}},\"archiveLocation\":{\"archiveLogs\":true,\"s3\":{\"endpoint\":\"minio-service.kubeflow:9000\",\"bucket\":\"mlpipeline\",\"insecure\":true,\"accessKeySecret\":{\"name\":\"mlpipeline-minio-artifact\",\"key\":\"accesskey\"},\"secretKeySecret\":{\"name\":\"mlpipeline-minio-artifact\",\"key\":\"secretkey\"},\"key\":\"artifacts/demo-training-pipeline-bn97r/demo-training-pipeline-bn97r-3933567025\"}}}"
time="2021-02-05T16:43:49Z" level=info msg="Waiting on main container"
time="2021-02-05T16:43:49Z" level=warning msg="Polling root processes (1m0s)"
time="2021-02-05T16:43:49Z" level=info msg="pid 23: &{root 4096 2147484013 {990000000 63747380757 0x2d10800} {2067 96 19 16749 0 0 0 0 4096 4096 8 {1611783965 816134965} {1611783957 990000000} {1611783957 990000000} [0 0 0]}}"
time="2021-02-05T16:43:49Z" level=info msg="Secured filehandle on /proc/23/root"
time="2021-02-05T16:43:49Z" level=info msg="containerID c1afed5dcc0e0e4224c5587dc5698e96c270a652199b4fadbdf788455e0d14e4 mapped to pid 23"
time="2021-02-05T16:43:49Z" level=info msg="pid 23: &{root 4096 2147484141 {136093600 63748140229 0x2d10800} {2097271 11666555 1 16877 0 0 0 0 4096 4096 8 {1612543429 136093600} {1612543429 136093600} {1612543429 467100230} [0 0 0]}}"
time="2021-02-05T16:43:49Z" level=info msg="Secured filehandle on /proc/23/root"
time="2021-02-05T16:43:49Z" level=info msg="pid 23: &{root 4096 2147484141 {136093600 63748140229 0x2d10800} {2097271 11666555 1 16877 0 0 0 0 4096 4096 8 {1612543429 136093600} {1612543429 136093600} {1612543429 467100230} [0 0 0]}}"
time="2021-02-05T16:43:49Z" level=info msg="pid 23: &{root 4096 2147484141 {136093600 63748140229 0x2d10800} {2097271 11666555 1 16877 0 0 0 0 4096 4096 8 {1612543429 593102754} {1612543429 136093600} {1612543429 592102734} [0 0 0]}}"
time="2021-02-05T16:43:49Z" level=info msg="pid 23: &{root 4096 2147484141 {136093600 63748140229 0x2d10800} {2097271 11666555 1 16877 0 0 0 0 4096 4096 8 {1612543429 593102754} {1612543429 136093600} {1612543429 592102734} [0 0 0]}}"
time="2021-02-05T16:43:49Z" level=info msg="pid 23: &{root 4096 2147484141 {136093600 63748140229 0x2d10800} {2097271 11666555 1 16877 0 0 0 0 4096 4096 8 {1612543429 593102754} {1612543429 136093600} {1612543429 592102734} [0 0 0]}}"
time="2021-02-05T16:43:49Z" level=info msg="pid 23: &{root 4096 2147484141 {136093600 63748140229 0x2d10800} {2097271 11666555 1 16877 0 0 0 0 4096 4096 8 {1612543429 593102754} {1612543429 136093600} {1612543429 592102734} [0 0 0]}}"
time="2021-02-05T16:43:49Z" level=info msg="pid 23: &{root 4096 2147484141 {136093600 63748140229 0x2d10800} {2097271 11666555 1 16877 0 0 0 0 4096 4096 8 {1612543429 593102754} {1612543429 136093600} {1612543429 805107001} [0 0 0]}}"
time="2021-02-05T16:43:49Z" level=info msg="pid 23: &{root 4096 2147484141 {136093600 63748140229 0x2d10800} {2097271 11666555 1 16877 0 0 0 0 4096 4096 8 {1612543429 593102754} {1612543429 136093600} {1612543429 805107001} [0 0 0]}}"
time="2021-02-05T16:43:49Z" level=info msg="pid 23: &{root 4096 2147484141 {136093600 63748140229 0x2d10800} {2097271 11666555 1 16877 0 0 0 0 4096 4096 8 {1612543429 593102754} {1612543429 136093600} {1612543429 805107001} [0 0 0]}}"
time="2021-02-05T16:43:49Z" level=info msg="pid 23: &{root 4096 2147484141 {136093600 63748140229 0x2d10800} {2097271 11666555 1 16877 0 0 0 0 4096 4096 8 {1612543429 593102754} {1612543429 136093600} {1612543429 805107001} [0 0 0]}}"
time="2021-02-05T16:43:50Z" level=info msg="pid 23: &{root 4096 2147484141 {136093600 63748140229 0x2d10800} {2097271 11666555 1 16877 0 0 0 0 4096 4096 8 {1612543429 593102754} {1612543429 136093600} {1612543429 805107001} [0 0 0]}}"
time="2021-02-05T16:43:50Z" level=info msg="pid 23: &{root 4096 2147484141 {136093600 63748140229 0x2d10800} {2097271 11666555 1 16877 0 0 0 0 4096 4096 8 {1612543429 593102754} {1612543429 136093600} {1612543429 805107001} [0 0 0]}}"
time="2021-02-05T16:43:50Z" level=info msg="pid 23: &{root 4096 2147484141 {136093600 63748140229 0x2d10800} {2097271 11666555 1 16877 0 0 0 0 4096 4096 8 {1612543429 593102754} {1612543429 136093600} {1612543429 805107001} [0 0 0]}}"
time="2021-02-05T16:43:50Z" level=info msg="pid 23: &{root 4096 2147484141 {136093600 63748140229 0x2d10800} {2097271 11666555 1 16877 0 0 0 0 4096 4096 8 {1612543429 593102754} {1612543429 136093600} {1612543429 805107001} [0 0 0]}}"
time="2021-02-05T16:43:50Z" level=info msg="pid 23: &{root 4096 2147484141 {136093600 63748140229 0x2d10800} {2097271 11666555 1 16877 0 0 0 0 4096 4096 8 {1612543429 593102754} {1612543429 136093600} {1612543429 805107001} [0 0 0]}}"
time="2021-02-05T16:43:50Z" level=info msg="main container started with container ID: c1afed5dcc0e0e4224c5587dc5698e96c270a652199b4fadbdf788455e0d14e4"
time="2021-02-05T16:43:50Z" level=info msg="Starting annotations monitor"
time="2021-02-05T16:43:50Z" level=info msg="pid 23: &{root 4096 2147484141 {136093600 63748140229 0x2d10800} {2097271 11666555 1 16877 0 0 0 0 4096 4096 8 {1612543429 593102754} {1612543429 136093600} {1612543429 805107001} [0 0 0]}}"
time="2021-02-05T16:43:50Z" level=info msg="Main pid identified as 23"
time="2021-02-05T16:43:50Z" level=info msg="Successfully secured file handle on main container root filesystem"
time="2021-02-05T16:43:50Z" level=info msg="Waiting for main pid 23 to complete"
time="2021-02-05T16:43:50Z" level=info msg="Starting deadline monitor"
time="2021-02-05T16:43:50Z" level=info msg="Stopped root processes polling due to successful securing of main root fs"
time="2021-02-05T16:44:00Z" level=info msg="/argo/podmetadata/annotations updated"
time="2021-02-05T16:44:01Z" level=info msg="Main pid 23 completed"
time="2021-02-05T16:44:01Z" level=info msg="Main container completed"
time="2021-02-05T16:44:01Z" level=info msg="Saving logs"
time="2021-02-05T16:44:01Z" level=info msg="Annotations monitor stopped"
time="2021-02-05T16:44:01Z" level=info msg="Deadline monitor stopped"
time="2021-02-05T16:44:01Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: artifacts/demo-training-pipeline-bn97r/demo-training-pipeline-bn97r-3933567025/main.log"
time="2021-02-05T16:44:01Z" level=info msg="Creating minio client minio-service.kubeflow:9000 using static credentials"
time="2021-02-05T16:44:01Z" level=info msg="Saving from /tmp/argo/outputs/logs/main.log to s3 (endpoint: minio-service.kubeflow:9000, bucket: mlpipeline, key: artifacts/demo-training-pipeline-bn97r/demo-training-pipeline-bn97r-3933567025/main.log)"
time="2021-02-05T16:44:01Z" level=info msg="Saving output parameters"
time="2021-02-05T16:44:01Z" level=info msg="Saving path output parameter: deploy-gen-iris-image"
time="2021-02-05T16:44:01Z" level=info msg="Copying /tmp/image from base image layer"
time="2021-02-05T16:44:01Z" level=error msg="executor error: open /tmp/image: no such file or directory"
time="2021-02-05T16:44:01Z" level=info msg="Killing sidecars"
time="2021-02-05T16:44:01Z" level=info msg="Alloc=5083 TotalAlloc=13039 Sys=71104 NumGC=4 Goroutines=14"
time="2021-02-05T16:44:01Z" level=fatal msg="open /tmp/image: no such file or directory"

kubectl logs demo-training-pipeline-bn97r-3933567025 -n kubeflow main

Runtime:
{'kind': 'job',
 'metadata': {'name': 'gen-iris', 'project': 'sk-project', 'tag': ''},
 'spec': {'args': [],
          'build': {'base_image': 'mlrun/mlrun',
                    'commands': ['pip install sklearn', 'pip install pyarrow'],
                    'functionSourceCode': 'IyBHZW5lcmF0ZWQgYnkgbnVjbGlvLmV4cG9ydC5OdWNsaW9FeHBvcnRlcgoKaW1wb3J0IG9zCmZyb20gc2tsZWFybi5kYXRhc2V0cyBpbXBvcnQgbG9hZF9pcmlzCmZyb20gc2tsZWFybi5tb2RlbF9zZWxlY3Rpb24gaW1wb3J0IHRyYWluX3Rlc3Rfc3BsaXQKaW1wb3J0IG51bXB5IGFzIG5wCmZyb20gc2tsZWFybi5tZXRyaWNzIGltcG9ydCBhY2N1cmFjeV9zY29yZQpmcm9tIG1scnVuLmFydGlmYWN0cyBpbXBvcnQgVGFibGVBcnRpZmFjdCwgUGxvdEFydGlmYWN0CmltcG9ydCBwYW5kYXMgYXMgcGQKCmRlZiBpcmlzX2dlbmVyYXRvcihjb250ZXh0LCBmb3JtYXQ9J2NzdicpOgogICAgaXJpcyA9IGxvYWRfaXJpcygpCiAgICBpcmlzX2RhdGFzZXQgPSBwZC5EYXRhRnJhbWUoZGF0YT1pcmlzLmRhdGEsIGNvbHVtbnM9aXJpcy5mZWF0dXJlX25hbWVzKQogICAgaXJpc19sYWJlbHMgPSBwZC5EYXRhRnJhbWUoZGF0YT1pcmlzLnRhcmdldCwgY29sdW1ucz1bJ2xhYmVsJ10pCiAgICBpcmlzX2RhdGFzZXQgPSBwZC5jb25jYXQoW2lyaXNfZGF0YXNldCwgaXJpc19sYWJlbHNdLCBheGlzPTEpCiAgICAKICAgIGNvbnRleHQubG9nZ2VyLmluZm8oJ3NhdmluZyBpcmlzIGRhdGFmcmFtZSB0byB7fScuZm9ybWF0KGNvbnRleHQuYXJ0aWZhY3RfcGF0aCkpCiAgICBjb250ZXh0LmxvZ19kYXRhc2V0KCdpcmlzX2RhdGFzZXQnLCBkZj1pcmlzX2RhdGFzZXQsIGZvcm1hdD1mb3JtYXQsIGluZGV4PUZhbHNlKQoK'},
          'command': '',
          'default_handler': '',
          'description': '',
          'entry_points': {'iris_generator': {'doc': '',
                                              'lineno': 11,
                                              'name': 'iris_generator',
                                              'outputs': [{'default': ''}],
                                              'parameters': [{'default': '',
                                                              'name': 'context'},
                                                             {'default': 'csv',
                                                              'name': 'format'}]}},
          'env': [],
          'volume_mounts': [{'mountPath': '/home/jovyan/data',
                             'name': 'pvc-55208e8c-6cf1-483f-a107-bea804c96384'}],
          'volumes': [{'name': 'pvc-55208e8c-6cf1-483f-a107-bea804c96384',
                       'persistentVolumeClaim': {'claimName': 'mlrun-kit-jupyter-pvc'}}]}}
> 2021-02-05 16:43:51,999 [info] remote deployment started
> 2021-02-05 16:43:51,999 [error] database connection is not configured
> 2021-02-05 16:43:51,999 [info] building image (.falkonryml/func-sk-project-gen-iris-latest)
FROM mlrun/mlrun:0.5.5-rc3
RUN pip install sklearn
RUN pip install pyarrow

> 2021-02-05 16:43:52,000 [info] using in-cluster config.
> 2021-02-05 16:43:52,019 [info] Pod mlrun-build-gen-iris-t2zh7 created
...
E0205 16:43:57.851534       1 aws_credentials.go:77] while getting AWS credentials NoCredentialProviders: no valid providers in chain. Deprecated.
    For verbose messaging see aws.Config.CredentialsChainVerboseErrors
error checking push permissions -- make sure you entered the correct tag name, and that you are authenticated correctly, and try again: checking push permission for "index.docker.io/falkonryml/func-sk-project-gen-iris-latest": POST https://index.docker.io/v2/falkonryml/func-sk-project-gen-iris-latest/blobs/uploads/: UNAUTHORIZED: authentication required; [map[Action:pull Class: Name:falkonryml/func-sk-project-gen-iris-latest Type:repository] map[Action:push Class: Name:falkonryml/func-sk-project-gen-iris-latest Type:repository]]
> 2021-02-05 16:43:59,926 [error] pod exited with error
> 2021-02-05 16:43:59,927 [info] build completed with failed
deploy error,  build failed!

mlrun / mlrun

Getting error while running workflow on kubernetes #244

Issue 1:

Issue 2: