kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.63k stars 1.63k forks source link

failed to save outputs: Error response from daemon: No such container #1471

Closed PascalSchroederDE closed 5 years ago

PascalSchroederDE commented 5 years ago

While trying to setup my own kubeflow pipeline I ran into a problem when one step is finished and the outputs should be saved. After finishing the step kubeflow always throws an error with the message This step is in Error state with this message: failed to save outputs: Error response from daemon: No such container: <container-id>

First I thought I would have made a mistake with my pipeline, but it's the same with the preexisting examples pipeline, e.g. for "[Sample] Basic - Conditional execution" I get this message after the first step (flip-coin) is finished.

The main container shows following output:

heads

So it seems to have run successfully.

The wait container shows following output:

time="2019-06-07T11:41:35Z" level=info msg="Creating a docker executor"
time="2019-06-07T11:41:35Z" level=info msg="Executor (version: v2.2.0, build_date: 2018-08-30T08:52:54Z) initialized with template:\narchiveLocation:\n  s3:\n    accessKeySecret:\n      key: accesskey\n      name: mlpipeline-minio-artifact\n    bucket: mlpipeline\n    endpoint: minio-service.kubeflow:9000\n    insecure: true\n    key: artifacts/conditional-execution-pipeline-vmdhx/conditional-execution-pipeline-vmdhx-2104306666\n    secretKeySecret:\n      key: secretkey\n      name: mlpipeline-minio-artifact\ncontainer:\n  args:\n  - python -c \"import random; result = 'heads' if random.randint(0,1) == 0 else 'tails';\n    print(result)\" | tee /tmp/output\n  command:\n  - sh\n  - -c\n  image: python:alpine3.6\n  name: \"\"\n  resources: {}\ninputs: {}\nmetadata: {}\nname: flip-coin\noutputs:\n  artifacts:\n  - name: mlpipeline-ui-metadata\n    path: /mlpipeline-ui-metadata.json\n  - name: mlpipeline-metrics\n    path: /mlpipeline-metrics.json\n  parameters:\n  - name: flip-coin-output\n    valueFrom:\n      path: /tmp/output\n"
time="2019-06-07T11:41:35Z" level=info msg="Waiting on main container"
time="2019-06-07T11:41:36Z" level=info msg="main container started with container ID: 7e3064415736db584cac5598a2b2a28728e11c03014ac67a05d008ad8119b13c"
time="2019-06-07T11:41:36Z" level=info msg="Starting annotations monitor"
time="2019-06-07T11:41:36Z" level=info msg="docker wait 7e3064415736db584cac5598a2b2a28728e11c03014ac67a05d008ad8119b13c"
time="2019-06-07T11:41:36Z" level=info msg="Starting deadline monitor"
time="2019-06-07T11:41:37Z" level=error msg="`docker wait 7e3064415736db584cac5598a2b2a28728e11c03014ac67a05d008ad8119b13c` failed: Error response from daemon: No such container: 7e3064415736db584cac5598a2b2a28728e11c03014ac67a05d008ad8119b13c\n"
time="2019-06-07T11:41:37Z" level=info msg="Main container completed"
time="2019-06-07T11:41:37Z" level=info msg="No sidecars"
time="2019-06-07T11:41:37Z" level=info msg="Saving output artifacts"
time="2019-06-07T11:41:37Z" level=info msg="Annotations monitor stopped"
time="2019-06-07T11:41:37Z" level=info msg="Saving artifact: mlpipeline-ui-metadata"
time="2019-06-07T11:41:37Z" level=info msg="Archiving 7e3064415736db584cac5598a2b2a28728e11c03014ac67a05d008ad8119b13c:/mlpipeline-ui-metadata.json to /argo/outputs/artifacts/mlpipeline-ui-metadata.tgz"
time="2019-06-07T11:41:37Z" level=info msg="sh -c docker cp -a 7e3064415736db584cac5598a2b2a28728e11c03014ac67a05d008ad8119b13c:/mlpipeline-ui-metadata.json - | gzip > /argo/outputs/artifacts/mlpipeline-ui-metadata.tgz"
time="2019-06-07T11:41:37Z" level=info msg="Archiving completed"
time="2019-06-07T11:41:37Z" level=info msg="Creating minio client minio-service.kubeflow:9000 using static credentials"
time="2019-06-07T11:41:37Z" level=info msg="Saving from /argo/outputs/artifacts/mlpipeline-ui-metadata.tgz to s3 (endpoint: minio-service.kubeflow:9000, bucket: mlpipeline, key: artifacts/conditional-execution-pipeline-vmdhx/conditional-execution-pipeline-vmdhx-2104306666/mlpipeline-ui-metadata.tgz)"
time="2019-06-07T11:41:37Z" level=info msg="Successfully saved file: /argo/outputs/artifacts/mlpipeline-ui-metadata.tgz"
time="2019-06-07T11:41:37Z" level=info msg="Saving artifact: mlpipeline-metrics"
time="2019-06-07T11:41:37Z" level=info msg="Archiving 7e3064415736db584cac5598a2b2a28728e11c03014ac67a05d008ad8119b13c:/mlpipeline-metrics.json to /argo/outputs/artifacts/mlpipeline-metrics.tgz"
time="2019-06-07T11:41:37Z" level=info msg="sh -c docker cp -a 7e3064415736db584cac5598a2b2a28728e11c03014ac67a05d008ad8119b13c:/mlpipeline-metrics.json - | gzip > /argo/outputs/artifacts/mlpipeline-metrics.tgz"
time="2019-06-07T11:41:37Z" level=info msg="Archiving completed"
time="2019-06-07T11:41:37Z" level=info msg="Creating minio client minio-service.kubeflow:9000 using static credentials"
time="2019-06-07T11:41:37Z" level=info msg="Saving from /argo/outputs/artifacts/mlpipeline-metrics.tgz to s3 (endpoint: minio-service.kubeflow:9000, bucket: mlpipeline, key: artifacts/conditional-execution-pipeline-vmdhx/conditional-execution-pipeline-vmdhx-2104306666/mlpipeline-metrics.tgz)"
time="2019-06-07T11:41:37Z" level=info msg="Successfully saved file: /argo/outputs/artifacts/mlpipeline-metrics.tgz"
time="2019-06-07T11:41:37Z" level=info msg="Saving output parameters"
time="2019-06-07T11:41:37Z" level=info msg="Saving path output parameter: flip-coin-output"
time="2019-06-07T11:41:37Z" level=info msg="[sh -c docker cp -a 7e3064415736db584cac5598a2b2a28728e11c03014ac67a05d008ad8119b13c:/tmp/output - | tar -ax -O]"
time="2019-06-07T11:41:37Z" level=error msg="`[sh -c docker cp -a 7e3064415736db584cac5598a2b2a28728e11c03014ac67a05d008ad8119b13c:/tmp/output - | tar -ax -O]` stderr:\nError: No such container:path: 7e3064415736db584cac5598a2b2a28728e11c03014ac67a05d008ad8119b13c:/tmp/output\ntar: This does not look like a tar archive\ntar: Exiting with failure status due to previous errors\n"
time="2019-06-07T11:41:37Z" level=info msg="Alloc=4338 TotalAlloc=11911 Sys=10598 NumGC=4 Goroutines=11"
time="2019-06-07T11:41:37Z" level=fatal msg="exit status 2\ngithub.com/argoproj/argo/errors.Wrap\n\t/root/go/src/github.com/argoproj/argo/errors/errors.go:87\ngithub.com/argoproj/argo/errors.InternalWrapError\n\t/root/go/src/github.com/argoproj/argo/errors/errors.go:70\ngithub.com/argoproj/argo/workflow/executor/docker.(*DockerExecutor).GetFileContents\n\t/root/go/src/github.com/argoproj/argo/workflow/executor/docker/docker.go:40\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).SaveParameters\n\t/root/go/src/github.com/argoproj/argo/workflow/executor/executor.go:343\ngithub.com/argoproj/argo/cmd/argoexec/commands.waitContainer\n\t/root/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:49\ngithub.com/argoproj/argo/cmd/argoexec/commands.glob..func4\n\t/root/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:19\ngithub.com/argoproj/argo/vendor/github.com/spf13/cobra.(*Command).execute\n\t/root/go/src/github.com/argoproj/argo/vendor/github.com/spf13/cobra/command.go:766\ngithub.com/argoproj/argo/vendor/github.com/spf13/cobra.(*Command).ExecuteC\n\t/root/go/src/github.com/argoproj/argo/vendor/github.com/spf13/cobra/command.go:852\ngithub.com/argoproj/argo/vendor/github.com/spf13/cobra.(*Command).Execute\n\t/root/go/src/github.com/argoproj/argo/vendor/github.com/spf13/cobra/command.go:800\nmain.main\n\t/root/go/src/github.com/argoproj/argo/cmd/argoexec/main.go:15\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:198\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:2361"

So it seems that there is a problem with either kubeflow or my docker daemon. The output of kubectl describe pods for the created pod is following:

Name:               conditional-execution-pipeline-vmdhx-2104306666
Namespace:          kubeflow
Priority:           0
PriorityClassName:  <none>
Node:               root-nuc8i5beh/9.233.5.90
Start Time:         Fri, 07 Jun 2019 13:41:29 +0200
Labels:             workflows.argoproj.io/completed=true
                    workflows.argoproj.io/workflow=conditional-execution-pipeline-vmdhx
Annotations:        workflows.argoproj.io/node-message:
                      Error response from daemon: No such container: 7e3064415736db584cac5598a2b2a28728e11c03014ac67a05d008ad8119b13c
                    workflows.argoproj.io/node-name: conditional-execution-pipeline-vmdhx.flip-coin
                    workflows.argoproj.io/template:
                      {"name":"flip-coin","inputs":{},"outputs":{"parameters":[{"name":"flip-coin-output","valueFrom":{"path":"/tmp/output"}}],"artifacts":[{"na...
Status:             Failed
IP:                 10.1.1.30
Controlled By:      Workflow/conditional-execution-pipeline-vmdhx
Containers:
  main:
    Container ID:  containerd://7e3064415736db584cac5598a2b2a28728e11c03014ac67a05d008ad8119b13c
    Image:         python:alpine3.6
    Image ID:      docker.io/library/python@sha256:766a961bf699491995cc29e20958ef11fd63741ff41dcc70ec34355b39d52971
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      python -c "import random; result = 'heads' if random.randint(0,1) == 0 else 'tails'; print(result)" | tee /tmp/output
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 07 Jun 2019 13:41:35 +0200
      Finished:     Fri, 07 Jun 2019 13:41:35 +0200
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from pipeline-runner-token-xh2p7 (ro)
  wait:
    Container ID:  containerd://f0449dc70c0a651c09aeb883edda9ce0ec5e415fa15a5468fe5b360fb06637c2
    Image:         argoproj/argoexec:v2.2.0
    Image ID:      docker.io/argoproj/argoexec@sha256:eea81e0b0d8899a0b7f9815c9c7bd89afa73ab32e5238430de82342b3bb7674a
    Port:          <none>
    Host Port:     <none>
    Command:
      argoexec
    Args:
      wait
    State:          Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Fri, 07 Jun 2019 13:41:35 +0200
      Finished:     Fri, 07 Jun 2019 13:41:37 +0200
    Ready:          False
    Restart Count:  0
    Environment:
      ARGO_POD_NAME:  conditional-execution-pipeline-vmdhx-2104306666 (v1:metadata.name)
    Mounts:
      /argo/podmetadata from podmetadata (rw)
      /var/lib/docker from docker-lib (ro)
      /var/run/docker.sock from docker-sock (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from pipeline-runner-token-xh2p7 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  podmetadata:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.annotations -> annotations
  docker-lib:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/docker
    HostPathType:  Directory
  docker-sock:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/docker.sock
    HostPathType:  Socket
  pipeline-runner-token-xh2p7:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  pipeline-runner-token-xh2p7
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age    From                     Message
  ----    ------     ----   ----                     -------
  Normal  Scheduled  8m1s   default-scheduler        Successfully assigned kubeflow/conditional-execution-pipeline-vmdhx-2104306666 to root-nuc8i5beh
  Normal  Pulling    8m1s   kubelet, root-nuc8i5beh  Pulling image "python:alpine3.6"
  Normal  Pulled     7m56s  kubelet, root-nuc8i5beh  Successfully pulled image "python:alpine3.6"
  Normal  Created    7m56s  kubelet, root-nuc8i5beh  Created container main
  Normal  Started    7m55s  kubelet, root-nuc8i5beh  Started container main
  Normal  Pulled     7m55s  kubelet, root-nuc8i5beh  Container image "argoproj/argoexec:v2.2.0" already present on machine
  Normal  Created    7m55s  kubelet, root-nuc8i5beh  Created container wait
  Normal  Started    7m55s  kubelet, root-nuc8i5beh  Started container wait

So probably there is a problem with the argoexec container image? I see it tries to mount /var/run/docker.sock. When I try to read this file with cat I get a "No such device or address" even though I can see the file with ls /var/run. When I try to open it with vi it mentions that the Permissions were denied, so I cannot see inside of the file. Is this the usual behavior with this file or does it seem like there are any problems with it?

I would really appreciate any help I can get! Thank you guys!

Ark-kun commented 5 years ago

What is your environment? Are you using GKE? Is it reproducible on your side? Can you try Argo's coin flip sample?

magreenberg1 commented 5 years ago

I'm having the exact sample problem with all of the Basic Samples. I'm running Kubeflow on-top of microk8s on a local machine.

Every time I try to run one of the samples I get: This step is in Error state with this message: failed to save outputs: Error response from daemon: No such container

And my output of kubectl describe pods is the same as the one above.

Ark-kun commented 5 years ago

This is upstream issue: https://github.com/kubeflow/kubeflow/issues/2347 https://github.com/ubuntu/microk8s/issues/434

PascalSchroederDE commented 5 years ago

Yes, I am running Kubeflow on top of microk8s as well. It doesnt work with the Flip Coin example neither, same error. So its probably related to issue 2347 as you mentioned. However, the suggested "dirty fix" is not working for me, because there is no /var/snap/microk8s/current/docker.sock which I could link the var/run/docker.sock to (probably because they replaced the docker daemon with containerd?). Any other ideas how to get it working? Or do I have to downgrade my microk8s?

magreenberg1 commented 5 years ago

I'm finding that I don't have /var/snap/microk8s/current/docker.sock or /var/snap/microk8s/common/var/lib/docker.

I have noticed that when I begin a new run, a new snapshot is created under containerd with a docker.sock and a lib/docker.

Finding docker.sock sudo find /var/snap/microk8s -name "docker.sock" returns... /var/snap/microk8s/common/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/2733/fs/run/docker.sock /var/snap/microk8s/common/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/2730/fs/run/docker.sock /var/snap/microk8s/common/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/2727/fs/run/docker.sock

Finding lib/docker sudo find /var/snap/microk8s -name "docker" -type d | grep "lib/docker" returns... /var/snap/microk8s/common/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/2733/fs/var/lib/docker /var/snap/microk8s/common/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/2730/fs/var/lib/docker /var/snap/microk8s/common/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/2727/fs/var/lib/docker

PascalSchroederDE commented 5 years ago

@magreenberg1 Could you solve the issue?

magreenberg1 commented 5 years ago

@PascalSchroederDE I have not. I suspect the short-term fix for me will either involve downgrading microk8s (and seeing if that works) or trying out MiniKF.

PascalSchroederDE commented 5 years ago

Switching to Minikube and setting up kubeflow pipelines on that Minikube cluster worked for me.

gnommer commented 5 years ago

would downgrading microk8 solve this issue cause i can see that the code in the container is executed but it is something to do with containerd handling the containers. i tried a single container pipeline i.e only one job and it ran but ended with the message similar to this issue. may be containerd daemon is pointing to some other place when its searching for containers ?

can any one fill me on this ?

JasonTam commented 5 years ago

I know this is several months old but FWIW, with microk8s v1.15.3 and kubeflow v0.6 , I solved this issue by changing the kubelet container-runtime from remote to docker by editing /var/snap/microk8s/current/args/kubelet :

#--container-runtime=remote
#--container-runtime-endpoint=${SNAP_COMMON}/run/containerd.sock
--container-runtime=docker
Ark-kun commented 5 years ago

I solved this issue by changing the kubelet container-runtime from remote to docker by editing /var/snap/microk8s/current/args/kubelet :

Switching Argo to non-Docker executor is probably needed for non-Docker environments. There are several issues discussing it.

ipal0 commented 4 years ago

I know this is several months old but FWIW, with microk8s v1.15.3 and kubeflow v0.6 , I solved this issue by changing the kubelet container-runtime from remote to docker by editing /var/snap/microk8s/current/args/kubelet :


#--container-runtime=remote
#--container-runtime-endpoint=${SNAP_COMMON}/run/containerd.sock
--container-runtime=docker

Yes. Absolutely when I changed --container-runtime=docker (from remote) everything started working. Thanks for the suggestion.

robinvanschaik commented 3 years ago

Sorry to re-open the issue.

I am currently in the process of deploying a tensorflow extended pipeline (v1 release candidate) on KFP 1.14 via the Google Cloud Platform marketplace.

Unfortunately, I am running into the same issue.

Can someone elaborate on how to tackle this in Kubeflow Pipelines on GCP?

Much appreciated!

TomomasaTakatori commented 3 years ago

I've encountered the same problem on AI Platform Pipelines of GCP as well.

The component process looked like being completed, but an error occurred during "wait" process.

The below is logging detail.

time="2021-05-28T03:19:52Z" level=info msg="Waiting on main container"
time="2021-05-28T03:19:53Z" level=info msg="main container started with container ID: b7213cf0a5cb59583b78b8020d3dc8b01272a8417d300586aa26255cdf908750"
time="2021-05-28T03:19:53Z" level=info msg="Starting annotations monitor"
time="2021-05-28T03:19:53Z" level=info msg="docker wait b7213cf0a5cb59583b78b8020d3dc8b01272a8417d300586aa26255cdf908750"
time="2021-05-28T03:19:53Z" level=info msg="Starting deadline monitor"
time="2021-05-28T03:19:53Z" level=error msg="`docker wait b7213cf0a5cb59583b78b8020d3dc8b01272a8417d300586aa26255cdf908750` failed: Error response from daemon: No such container: b7213cf0a5cb59583b78b8020d3dc8b01272a8417d300586aa26255cdf908750\n"
time="2021-05-28T03:19:53Z" level=warning msg="Failed to wait for container id 'b7213cf0a5cb59583b78b8020d3dc8b01272a8417d300586aa26255cdf908750': Error response from daemon: No such container: b7213cf0a5cb59583b78b8020d3dc8b01272a8417d300586aa26255cdf908750"
time="2021-05-28T03:19:53Z" level=error msg="executor error: Error response from daemon: No such container: b7213cf0a5cb59583b78b8020d3dc8b01272a8417d300586aa26255cdf908750\ngithub.com/argoproj/argo/errors.New\n\t/go/src/github.com/argoproj/argo/errors/errors.go:49\ngithub.com/argoproj/argo/errors.InternalError\n\t/go/src/github.com/argoproj/argo/errors/errors.go:60\ngithub.com/argoproj/argo/workflow/common.RunCommand\n\t/go/src/github.com/argoproj/argo/workflow/common/util.go:406\ngithub.com/argoproj/argo/workflow/executor/docker.(*DockerExecutor).Wait\n\t/go/src/github.com/argoproj/argo/workflow/executor/docker/docker.go:139\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).Wait.func1\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:829\nk8s.io/apimachinery/pkg/util/wait.ExponentialBackoff\n\t/go/pkg/mod/k8s.io/apimachinery@v0.16.7-beta.0/pkg/util/wait/wait.go:292\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).Wait\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:828\ngithub.com/argoproj/argo/cmd/argoexec/commands.waitContainer\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:40\ngithub.com/argoproj/argo/cmd/argoexec/commands.NewWaitCommand.func1\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:16\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:766\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:852\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:800\nmain.main\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/main.go:17\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"
time="2021-05-28T03:19:53Z" level=info msg="Saving logs"
time="2021-05-28T03:19:53Z" level=info msg="[docker logs b7213cf0a5cb59583b78b8020d3dc8b01272a8417d300586aa26255cdf908750]"
time="2021-05-28T03:19:53Z" level=info msg="Annotations monitor stopped"
time="2021-05-28T03:19:53Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: artifacts/example-pipeline-99ndk/example-pipeline-99ndk-1632878199/main.log"
time="2021-05-28T03:19:53Z" level=info msg="Creating minio client minio-service.default:9000 using static credentials"
time="2021-05-28T03:19:53Z" level=info msg="Saving from /tmp/argo/outputs/logs/main.log to s3 (endpoint: minio-service.default:9000, bucket: mlpipeline, key: artifacts/example-pipeline-99ndk/example-pipeline-99ndk-1632878199/main.log)"
time="2021-05-28T03:19:53Z" level=info msg="No output parameters"
time="2021-05-28T03:19:53Z" level=info msg="Saving output artifacts"
time="2021-05-28T03:19:53Z" level=info msg="Staging artifact: mlpipeline-ui-metadata"
time="2021-05-28T03:19:53Z" level=info msg="Copying /tmp/outputs/MLPipeline_UI_metadata/data from container base image layer to /tmp/argo/outputs/artifacts/mlpipeline-ui-metadata.tgz"
time="2021-05-28T03:19:53Z" level=info msg="Archiving b7213cf0a5cb59583b78b8020d3dc8b01272a8417d300586aa26255cdf908750:/tmp/outputs/MLPipeline_UI_metadata/data to /tmp/argo/outputs/artifacts/mlpipeline-ui-metadata.tgz"
time="2021-05-28T03:19:53Z" level=info msg="sh -c docker cp -a b7213cf0a5cb59583b78b8020d3dc8b01272a8417d300586aa26255cdf908750:/tmp/outputs/MLPipeline_UI_metadata/data - | gzip > /tmp/argo/outputs/artifacts/mlpipeline-ui-metadata.tgz"
time="2021-05-28T03:19:53Z" level=warning msg="path /tmp/outputs/MLPipeline_UI_metadata/data does not exist in archive /tmp/argo/outputs/artifacts/mlpipeline-ui-metadata.tgz"
time="2021-05-28T03:19:53Z" level=warning msg="Ignoring optional artifact 'mlpipeline-ui-metadata' which does not exist in path '/tmp/outputs/MLPipeline_UI_metadata/data': path /tmp/outputs/MLPipeline_UI_metadata/data does not exist in archive /tmp/argo/outputs/artifacts/mlpipeline-ui-metadata.tgz"
time="2021-05-28T03:19:53Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2021-05-28T03:19:53Z" level=info msg="Annotating pod with output"
time="2021-05-28T03:19:53Z" level=info msg="Killing sidecars"
time="2021-05-28T03:19:53Z" level=info msg="Alloc=5766 TotalAlloc=12930 Sys=70080 NumGC=4 Goroutines=13"
time="2021-05-28T03:19:53Z" level=fatal msg="Error response from daemon: No such container: b7213cf0a5cb59583b78b8020d3dc8b01272a8417d300586aa26255cdf908750\ngithub.com/argoproj/argo/errors.New\n\t/go/src/github.com/argoproj/argo/errors/errors.go:49\ngithub.com/argoproj/argo/errors.InternalError\n\t/go/src/github.com/argoproj/argo/errors/errors.go:60\ngithub.com/argoproj/argo/workflow/common.RunCommand\n\t/go/src/github.com/argoproj/argo/workflow/common/util.go:406\ngithub.com/argoproj/argo/workflow/executor/docker.(*DockerExecutor).Wait\n\t/go/src/github.com/argoproj/argo/workflow/executor/docker/docker.go:139\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).Wait.func1\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:829\nk8s.io/apimachinery/pkg/util/wait.ExponentialBackoff\n\t/go/pkg/mod/k8s.io/apimachinery@v0.16.7-beta.0/pkg/util/wait/wait.go:292\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).Wait\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:828\ngithub.com/argoproj/argo/cmd/argoexec/commands.waitContainer\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:40\ngithub.com/argoproj/argo/cmd/argoexec/commands.NewWaitCommand.func1\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:16\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:766\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:852\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:800\nmain.main\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/main.go:17\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"

I would really appreciate some help. Thank you.

gabriellemadden commented 3 years ago

See the GKE release notes for 1.19.9 - they move away from docker runtime

In our case, we were trying to run argo workflows which uses the docker container runtime by default. We upgraded a cluster to kubernetes 1.19.9 - which changes the default runtime to containerd - and suddenly none of our workflows would start, with our "wait" containers also complaining that they could not find containers. The solution for us was to explicitly tell argo workflows to use a different container runtime from docker (we switched to k8sapi). See the helm chart containerRuntimeExecutor and the possible argo-workflow executor environment variables.

prasunsultania commented 3 years ago

I had similar issues as @gabriellemadden . I updated the configmap to use k8sapi as below:

apiVersion: v1
kind: ConfigMap
metadata:
  name: workflow-controller-configmap
data:
  config: |
    executor:
      env:
      - name: ARGO_CONTAINER_RUNTIME_EXECUTOR
        value: k8sapi

I was not sure if this will take an immediate effect or restart the containers so, I explicitly restarted argo-server and workflow-controller pods.

jiyongjung0 commented 3 years ago

FYI, I met the same issue. But I switched to pns because I faced the following error when I tried k8sapi.

failed to save outputs: CopyFile() is not implemented in the k8sapi executor.

When I tried to use pns using environment variable as in the previous comment, it failed again with the following error. process namespace sharing is not enabled on pod

So I just set it directly like following and it worked.

apiVersion: v1
data:
  config: |
    {
    namespace: kubeflow,
    executorImage: gcr.io/cloud-marketplace/google-cloud-ai-platform/kubeflow-pipelines/argoexecutor:1.4.1,
    artifactRepository:
    {
        s3: {
            bucket: 'mlpipeline',
            keyPrefix: artifacts,
            endpoint: minio-service.kubeflow:9000,
            insecure: true,
            accessKeySecret: {
                name: mlpipeline-minio-artifact,
                key: accesskey
            },
            secretKeySecret: {
                name: mlpipeline-minio-artifact,
                key: secretkey
            }
        },
        archiveLogs: true
    },
    containerRuntimeExecutor: pns
    }
kind: ConfigMap
metadata:
  creationTimestamp: null
  name: workflow-controller-configmap-pns
  namespace: kubeflow

See also https://github.com/kubeflow/pipelines/issues/1654 which contains interesting discussion on executors.

huangjiasingle commented 3 years ago

when l only use argo workflow, the wait container also result: No such container: containerID, l read the wait code and view the wait container logs. include two case: case one: when main container recreated and the old main container was be gc collector deleted, the docker wait old container will return err: No such container. but pollContainerIDs will update the main container id to new created one, so we get the latest main container id and wait it. (update code) case two: sometimes when execute docker wait containerID, the docker daemon return No such container: containerID, but the container id exist on the node, maybe the os loader vary high or disk io-wait is vary big and so on cause that, when first execute docker wait containerID, return err, if we do retry some times, may be no error.

nongmo677 commented 3 years ago

仅供参考,我遇到了同样的问题。但是我改用了,pns因为我在尝试时遇到了以下错误k8sapi

failed to save outputs: CopyFile() is not implemented in the k8sapi executor.

当我尝试pns像前面的评论一样使用环境变量时,它再次失败并出现以下错误。 process namespace sharing is not enabled on pod

所以我只是像下面一样直接设置它并且它起作用了。

apiVersion: v1
data:
  config: |
    {
    namespace: kubeflow,
    executorImage: gcr.io/cloud-marketplace/google-cloud-ai-platform/kubeflow-pipelines/argoexecutor:1.4.1,
    artifactRepository:
    {
        s3: {
            bucket: 'mlpipeline',
            keyPrefix: artifacts,
            endpoint: minio-service.kubeflow:9000,
            insecure: true,
            accessKeySecret: {
                name: mlpipeline-minio-artifact,
                key: accesskey
            },
            secretKeySecret: {
                name: mlpipeline-minio-artifact,
                key: secretkey
            }
        },
        archiveLogs: true
    },
    containerRuntimeExecutor: pns
    }
kind: ConfigMap
metadata:
  creationTimestamp: null
  name: workflow-controller-configmap-pns
  namespace: kubeflow

另请参阅#1654,其中包含有关执行程序的有趣讨论。

pns is the key , it work for me , 3Q

bsikander commented 3 years ago

On GCP, if you are using AI Platform Pipelines are having this issue then you need to change your kubernetes deployment and change the image type from containerd to docker. This worked fine for me. My k8s version was 1.19 and kubeflow version was 1.4.1. Screenshot 2021-11-25 at 02 50 36

ghost commented 2 years ago

On GCP, if you are using AI Platform Pipelines are having this issue then you need to change your kubernetes deployment and change the image type from containerd to docker. This worked fine for me. My k8s version was 1.19 and kubeflow version was 1.4.1. Screenshot 2021-11-25 at 02 50 36

This also works for me. k8s - 1.21.6 and kubeflow(from GCP marketplace) - 1.7.1

jinisaweaklearner commented 1 year ago

FYI, I met the same issue. But I switched to pns because I faced the following error when I tried k8sapi.

failed to save outputs: CopyFile() is not implemented in the k8sapi executor.

When I tried to use pns using environment variable as in the previous comment, it failed again with the following error. process namespace sharing is not enabled on pod

So I just set it directly like following and it worked.

apiVersion: v1
data:
  config: |
    {
    namespace: kubeflow,
    executorImage: gcr.io/cloud-marketplace/google-cloud-ai-platform/kubeflow-pipelines/argoexecutor:1.4.1,
    artifactRepository:
    {
        s3: {
            bucket: 'mlpipeline',
            keyPrefix: artifacts,
            endpoint: minio-service.kubeflow:9000,
            insecure: true,
            accessKeySecret: {
                name: mlpipeline-minio-artifact,
                key: accesskey
            },
            secretKeySecret: {
                name: mlpipeline-minio-artifact,
                key: secretkey
            }
        },
        archiveLogs: true
    },
    containerRuntimeExecutor: pns
    }
kind: ConfigMap
metadata:
  creationTimestamp: null
  name: workflow-controller-configmap-pns
  namespace: kubeflow

See also #1654 which contains interesting discussion on executors.

Thanks for this. I have the same error message when GKE is upgraded to 1.25 without the support of docker image. To use the containerd image, changing containerRuntimeExecutor to pns works for me.