Closed PascalSchroederDE closed 5 years ago
What is your environment? Are you using GKE? Is it reproducible on your side? Can you try Argo's coin flip sample?
I'm having the exact sample problem with all of the Basic Samples. I'm running Kubeflow on-top of microk8s on a local machine.
Every time I try to run one of the samples I get: This step is in Error state with this message: failed to save outputs: Error response from daemon: No such container
And my output of kubectl describe pods
is the same as the one above.
This is upstream issue: https://github.com/kubeflow/kubeflow/issues/2347 https://github.com/ubuntu/microk8s/issues/434
Yes, I am running Kubeflow on top of microk8s as well. It doesnt work with the Flip Coin example neither, same error. So its probably related to issue 2347 as you mentioned. However, the suggested "dirty fix" is not working for me, because there is no /var/snap/microk8s/current/docker.sock
which I could link the var/run/docker.sock
to (probably because they replaced the docker daemon with containerd?). Any other ideas how to get it working? Or do I have to downgrade my microk8s?
I'm finding that I don't have /var/snap/microk8s/current/docker.sock
or /var/snap/microk8s/common/var/lib/docker
.
I have noticed that when I begin a new run, a new snapshot is created under containerd with a docker.sock
and a lib/docker
.
Finding docker.sock
sudo find /var/snap/microk8s -name "docker.sock"
returns...
/var/snap/microk8s/common/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/2733/fs/run/docker.sock /var/snap/microk8s/common/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/2730/fs/run/docker.sock /var/snap/microk8s/common/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/2727/fs/run/docker.sock
Finding lib/docker
sudo find /var/snap/microk8s -name "docker" -type d | grep "lib/docker"
returns...
/var/snap/microk8s/common/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/2733/fs/var/lib/docker /var/snap/microk8s/common/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/2730/fs/var/lib/docker /var/snap/microk8s/common/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/2727/fs/var/lib/docker
@magreenberg1 Could you solve the issue?
@PascalSchroederDE I have not. I suspect the short-term fix for me will either involve downgrading microk8s (and seeing if that works) or trying out MiniKF.
Switching to Minikube and setting up kubeflow pipelines on that Minikube cluster worked for me.
would downgrading microk8 solve this issue cause i can see that the code in the container is executed but it is something to do with containerd handling the containers. i tried a single container pipeline i.e only one job and it ran but ended with the message similar to this issue. may be containerd daemon is pointing to some other place when its searching for containers ?
can any one fill me on this ?
I know this is several months old but FWIW, with microk8s v1.15.3 and kubeflow v0.6 , I solved this issue by changing the kubelet container-runtime from remote to docker by editing /var/snap/microk8s/current/args/kubelet
:
#--container-runtime=remote
#--container-runtime-endpoint=${SNAP_COMMON}/run/containerd.sock
--container-runtime=docker
I solved this issue by changing the kubelet container-runtime from remote to docker by editing
/var/snap/microk8s/current/args/kubelet
:
Switching Argo to non-Docker executor is probably needed for non-Docker environments. There are several issues discussing it.
I know this is several months old but FWIW, with microk8s v1.15.3 and kubeflow v0.6 , I solved this issue by changing the kubelet container-runtime from remote to docker by editing
/var/snap/microk8s/current/args/kubelet
:#--container-runtime=remote #--container-runtime-endpoint=${SNAP_COMMON}/run/containerd.sock --container-runtime=docker
Yes. Absolutely when I changed --container-runtime=docker (from remote) everything started working. Thanks for the suggestion.
Sorry to re-open the issue.
I am currently in the process of deploying a tensorflow extended pipeline (v1 release candidate) on KFP 1.14 via the Google Cloud Platform marketplace.
Unfortunately, I am running into the same issue.
Can someone elaborate on how to tackle this in Kubeflow Pipelines on GCP?
Much appreciated!
I've encountered the same problem on AI Platform Pipelines of GCP as well.
The component process looked like being completed, but an error occurred during "wait" process.
The below is logging detail.
time="2021-05-28T03:19:52Z" level=info msg="Waiting on main container"
time="2021-05-28T03:19:53Z" level=info msg="main container started with container ID: b7213cf0a5cb59583b78b8020d3dc8b01272a8417d300586aa26255cdf908750"
time="2021-05-28T03:19:53Z" level=info msg="Starting annotations monitor"
time="2021-05-28T03:19:53Z" level=info msg="docker wait b7213cf0a5cb59583b78b8020d3dc8b01272a8417d300586aa26255cdf908750"
time="2021-05-28T03:19:53Z" level=info msg="Starting deadline monitor"
time="2021-05-28T03:19:53Z" level=error msg="`docker wait b7213cf0a5cb59583b78b8020d3dc8b01272a8417d300586aa26255cdf908750` failed: Error response from daemon: No such container: b7213cf0a5cb59583b78b8020d3dc8b01272a8417d300586aa26255cdf908750\n"
time="2021-05-28T03:19:53Z" level=warning msg="Failed to wait for container id 'b7213cf0a5cb59583b78b8020d3dc8b01272a8417d300586aa26255cdf908750': Error response from daemon: No such container: b7213cf0a5cb59583b78b8020d3dc8b01272a8417d300586aa26255cdf908750"
time="2021-05-28T03:19:53Z" level=error msg="executor error: Error response from daemon: No such container: b7213cf0a5cb59583b78b8020d3dc8b01272a8417d300586aa26255cdf908750\ngithub.com/argoproj/argo/errors.New\n\t/go/src/github.com/argoproj/argo/errors/errors.go:49\ngithub.com/argoproj/argo/errors.InternalError\n\t/go/src/github.com/argoproj/argo/errors/errors.go:60\ngithub.com/argoproj/argo/workflow/common.RunCommand\n\t/go/src/github.com/argoproj/argo/workflow/common/util.go:406\ngithub.com/argoproj/argo/workflow/executor/docker.(*DockerExecutor).Wait\n\t/go/src/github.com/argoproj/argo/workflow/executor/docker/docker.go:139\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).Wait.func1\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:829\nk8s.io/apimachinery/pkg/util/wait.ExponentialBackoff\n\t/go/pkg/mod/k8s.io/apimachinery@v0.16.7-beta.0/pkg/util/wait/wait.go:292\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).Wait\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:828\ngithub.com/argoproj/argo/cmd/argoexec/commands.waitContainer\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:40\ngithub.com/argoproj/argo/cmd/argoexec/commands.NewWaitCommand.func1\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:16\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:766\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:852\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:800\nmain.main\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/main.go:17\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"
time="2021-05-28T03:19:53Z" level=info msg="Saving logs"
time="2021-05-28T03:19:53Z" level=info msg="[docker logs b7213cf0a5cb59583b78b8020d3dc8b01272a8417d300586aa26255cdf908750]"
time="2021-05-28T03:19:53Z" level=info msg="Annotations monitor stopped"
time="2021-05-28T03:19:53Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: artifacts/example-pipeline-99ndk/example-pipeline-99ndk-1632878199/main.log"
time="2021-05-28T03:19:53Z" level=info msg="Creating minio client minio-service.default:9000 using static credentials"
time="2021-05-28T03:19:53Z" level=info msg="Saving from /tmp/argo/outputs/logs/main.log to s3 (endpoint: minio-service.default:9000, bucket: mlpipeline, key: artifacts/example-pipeline-99ndk/example-pipeline-99ndk-1632878199/main.log)"
time="2021-05-28T03:19:53Z" level=info msg="No output parameters"
time="2021-05-28T03:19:53Z" level=info msg="Saving output artifacts"
time="2021-05-28T03:19:53Z" level=info msg="Staging artifact: mlpipeline-ui-metadata"
time="2021-05-28T03:19:53Z" level=info msg="Copying /tmp/outputs/MLPipeline_UI_metadata/data from container base image layer to /tmp/argo/outputs/artifacts/mlpipeline-ui-metadata.tgz"
time="2021-05-28T03:19:53Z" level=info msg="Archiving b7213cf0a5cb59583b78b8020d3dc8b01272a8417d300586aa26255cdf908750:/tmp/outputs/MLPipeline_UI_metadata/data to /tmp/argo/outputs/artifacts/mlpipeline-ui-metadata.tgz"
time="2021-05-28T03:19:53Z" level=info msg="sh -c docker cp -a b7213cf0a5cb59583b78b8020d3dc8b01272a8417d300586aa26255cdf908750:/tmp/outputs/MLPipeline_UI_metadata/data - | gzip > /tmp/argo/outputs/artifacts/mlpipeline-ui-metadata.tgz"
time="2021-05-28T03:19:53Z" level=warning msg="path /tmp/outputs/MLPipeline_UI_metadata/data does not exist in archive /tmp/argo/outputs/artifacts/mlpipeline-ui-metadata.tgz"
time="2021-05-28T03:19:53Z" level=warning msg="Ignoring optional artifact 'mlpipeline-ui-metadata' which does not exist in path '/tmp/outputs/MLPipeline_UI_metadata/data': path /tmp/outputs/MLPipeline_UI_metadata/data does not exist in archive /tmp/argo/outputs/artifacts/mlpipeline-ui-metadata.tgz"
time="2021-05-28T03:19:53Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2021-05-28T03:19:53Z" level=info msg="Annotating pod with output"
time="2021-05-28T03:19:53Z" level=info msg="Killing sidecars"
time="2021-05-28T03:19:53Z" level=info msg="Alloc=5766 TotalAlloc=12930 Sys=70080 NumGC=4 Goroutines=13"
time="2021-05-28T03:19:53Z" level=fatal msg="Error response from daemon: No such container: b7213cf0a5cb59583b78b8020d3dc8b01272a8417d300586aa26255cdf908750\ngithub.com/argoproj/argo/errors.New\n\t/go/src/github.com/argoproj/argo/errors/errors.go:49\ngithub.com/argoproj/argo/errors.InternalError\n\t/go/src/github.com/argoproj/argo/errors/errors.go:60\ngithub.com/argoproj/argo/workflow/common.RunCommand\n\t/go/src/github.com/argoproj/argo/workflow/common/util.go:406\ngithub.com/argoproj/argo/workflow/executor/docker.(*DockerExecutor).Wait\n\t/go/src/github.com/argoproj/argo/workflow/executor/docker/docker.go:139\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).Wait.func1\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:829\nk8s.io/apimachinery/pkg/util/wait.ExponentialBackoff\n\t/go/pkg/mod/k8s.io/apimachinery@v0.16.7-beta.0/pkg/util/wait/wait.go:292\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).Wait\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:828\ngithub.com/argoproj/argo/cmd/argoexec/commands.waitContainer\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:40\ngithub.com/argoproj/argo/cmd/argoexec/commands.NewWaitCommand.func1\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:16\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:766\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:852\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:800\nmain.main\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/main.go:17\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"
I would really appreciate some help. Thank you.
See the GKE release notes for 1.19.9 - they move away from docker runtime
In our case, we were trying to run argo workflows which uses the docker container runtime by default. We upgraded a cluster to kubernetes 1.19.9 - which changes the default runtime to containerd - and suddenly none of our workflows would start, with our "wait" containers also complaining that they could not find containers. The solution for us was to explicitly tell argo workflows to use a different container runtime from docker (we switched to k8sapi). See the helm chart containerRuntimeExecutor and the possible argo-workflow executor environment variables.
I had similar issues as @gabriellemadden . I updated the configmap
to use k8sapi
as below:
apiVersion: v1
kind: ConfigMap
metadata:
name: workflow-controller-configmap
data:
config: |
executor:
env:
- name: ARGO_CONTAINER_RUNTIME_EXECUTOR
value: k8sapi
I was not sure if this will take an immediate effect or restart the containers so, I explicitly restarted argo-server
and workflow-controller
pods.
FYI, I met the same issue. But I switched to pns
because I faced the following error when I tried k8sapi
.
failed to save outputs: CopyFile() is not implemented in the k8sapi executor.
When I tried to use pns
using environment variable as in the previous comment, it failed again with the following error.
process namespace sharing is not enabled on pod
So I just set it directly like following and it worked.
apiVersion: v1
data:
config: |
{
namespace: kubeflow,
executorImage: gcr.io/cloud-marketplace/google-cloud-ai-platform/kubeflow-pipelines/argoexecutor:1.4.1,
artifactRepository:
{
s3: {
bucket: 'mlpipeline',
keyPrefix: artifacts,
endpoint: minio-service.kubeflow:9000,
insecure: true,
accessKeySecret: {
name: mlpipeline-minio-artifact,
key: accesskey
},
secretKeySecret: {
name: mlpipeline-minio-artifact,
key: secretkey
}
},
archiveLogs: true
},
containerRuntimeExecutor: pns
}
kind: ConfigMap
metadata:
creationTimestamp: null
name: workflow-controller-configmap-pns
namespace: kubeflow
See also https://github.com/kubeflow/pipelines/issues/1654 which contains interesting discussion on executors.
when l only use argo workflow, the wait container also result: No such container: containerID, l read the wait code and view the wait container logs. include two case: case one: when main container recreated and the old main container was be gc collector deleted, the docker wait old container will return err: No such container. but pollContainerIDs will update the main container id to new created one, so we get the latest main container id and wait it. (update code) case two: sometimes when execute docker wait containerID, the docker daemon return No such container: containerID, but the container id exist on the node, maybe the os loader vary high or disk io-wait is vary big and so on cause that, when first execute docker wait containerID, return err, if we do retry some times, may be no error.
仅供参考,我遇到了同样的问题。但是我改用了,
pns
因为我在尝试时遇到了以下错误k8sapi
。failed to save outputs: CopyFile() is not implemented in the k8sapi executor.
当我尝试
pns
像前面的评论一样使用环境变量时,它再次失败并出现以下错误。process namespace sharing is not enabled on pod
所以我只是像下面一样直接设置它并且它起作用了。
apiVersion: v1 data: config: | { namespace: kubeflow, executorImage: gcr.io/cloud-marketplace/google-cloud-ai-platform/kubeflow-pipelines/argoexecutor:1.4.1, artifactRepository: { s3: { bucket: 'mlpipeline', keyPrefix: artifacts, endpoint: minio-service.kubeflow:9000, insecure: true, accessKeySecret: { name: mlpipeline-minio-artifact, key: accesskey }, secretKeySecret: { name: mlpipeline-minio-artifact, key: secretkey } }, archiveLogs: true }, containerRuntimeExecutor: pns } kind: ConfigMap metadata: creationTimestamp: null name: workflow-controller-configmap-pns namespace: kubeflow
另请参阅#1654,其中包含有关执行程序的有趣讨论。
pns
is the key , it work for me , 3Q
On GCP, if you are using AI Platform Pipelines are having this issue then you need to change your kubernetes deployment and change the image type from containerd to docker. This worked fine for me. My k8s version was 1.19 and kubeflow version was 1.4.1.
On GCP, if you are using AI Platform Pipelines are having this issue then you need to change your kubernetes deployment and change the image type from containerd to docker. This worked fine for me. My k8s version was 1.19 and kubeflow version was 1.4.1.
This also works for me. k8s - 1.21.6 and kubeflow(from GCP marketplace) - 1.7.1
FYI, I met the same issue. But I switched to
pns
because I faced the following error when I triedk8sapi
.failed to save outputs: CopyFile() is not implemented in the k8sapi executor.
When I tried to use
pns
using environment variable as in the previous comment, it failed again with the following error.process namespace sharing is not enabled on pod
So I just set it directly like following and it worked.
apiVersion: v1 data: config: | { namespace: kubeflow, executorImage: gcr.io/cloud-marketplace/google-cloud-ai-platform/kubeflow-pipelines/argoexecutor:1.4.1, artifactRepository: { s3: { bucket: 'mlpipeline', keyPrefix: artifacts, endpoint: minio-service.kubeflow:9000, insecure: true, accessKeySecret: { name: mlpipeline-minio-artifact, key: accesskey }, secretKeySecret: { name: mlpipeline-minio-artifact, key: secretkey } }, archiveLogs: true }, containerRuntimeExecutor: pns } kind: ConfigMap metadata: creationTimestamp: null name: workflow-controller-configmap-pns namespace: kubeflow
See also #1654 which contains interesting discussion on executors.
Thanks for this. I have the same error message when GKE is upgraded to 1.25 without the support of docker image. To use the containerd image, changing containerRuntimeExecutor to pns works for me.
While trying to setup my own kubeflow pipeline I ran into a problem when one step is finished and the outputs should be saved. After finishing the step kubeflow always throws an error with the message
This step is in Error state with this message: failed to save outputs: Error response from daemon: No such container: <container-id>
First I thought I would have made a mistake with my pipeline, but it's the same with the preexisting examples pipeline, e.g. for "[Sample] Basic - Conditional execution" I get this message after the first step (flip-coin) is finished.
The main container shows following output:
So it seems to have run successfully.
The wait container shows following output:
So it seems that there is a problem with either kubeflow or my docker daemon. The output of
kubectl describe pods
for the created pod is following:So probably there is a problem with the argoexec container image? I see it tries to mount /var/run/docker.sock. When I try to read this file with
cat
I get a "No such device or address" even though I can see the file withls /var/run
. When I try to open it withvi
it mentions that the Permissions were denied, so I cannot see inside of the file. Is this the usual behavior with this file or does it seem like there are any problems with it?I would really appreciate any help I can get! Thank you guys!