kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.59k stars 1.62k forks source link

Offical artifact passing example fails with PNS executor #4257

Closed juliusvonkohout closed 3 years ago

juliusvonkohout commented 4 years ago

What steps did you take:

I am using the offical example https://github.com/argoproj/argo/blob/master/examples/artifact-passing.yaml that runs fine out of the box with the argo docker executor.

Then i changed the executor to pns

apiVersion: v1
data:
  config: |
    {
    namespace: kubeflow,
    containerRuntimeExecutor: pns,
    executorImage: gcr.io/ml-pipeline/argoexec:v2.7.5-license-compliance,
    ...

What happened:

Every pipeline that passes outputs (including the offical example) is now failing. The problem seems to be that the main container exits properly and the wait container cannot chroot into it anymore:

"executor error: could not chroot into main for artifact collection: container may have exited too quickly"

The docker executor works around this by abusing docker.sock to copy the outputs from the terminated main container which is obviously completely infeasible in production.

The funny thing is that you can manually mount an emptydir under /tmp/outputs and add the proper output path (e.g. tmp/outputs/numbers/data) to op.output_artifact_paths.

def add_emptydir(op):
    from kubernetes import client as k8s_client
    op.add_volume(k8s_client.V1Volume(name='outputs', empty_dir=k8s_client.V1EmptyDirVolumeSource()))
    op.container.add_volume_mount(k8s_client.V1VolumeMount(name='outputs', mount_path='tmp/outputs'))
    op.output_artifact_paths={
        'mlpipeline-ui-metadata': 'tmp/outputs/mlpipeline-ui-metadata.json',
        'mlpipeline-metrics': 'tmp/outputs/mlpipeline-metrics.json',
        'extract-as-artifact': 'tmp/outputs/numbers/data',
    }
    return op

Then the output file (tmp/outputs/numbers/data) is successfully extracted via the mirrored mounts functionality, but extracting the same file with chroot fails.

What did you expect to happen:

I expect PNS to extract the output successfully

Environment:

i just tried kubeflow pipelines on a Kubernetes 1.17 (Azure) and 1.18 (minikube) cluster with docker as container engine.

How did you deploy Kubeflow Pipelines (KFP)?

Download and extract https://github.com/kubeflow/pipelines/archive/1.0.0.zip Install with kubectl apply -k '/home/julius/Schreibtisch/kubeflow/pipelines-1.0.0/manifests/kustomize/cluster-scoped-resources' kubectl wait --for condition=established --timeout=60s crd/applications.app.k8s.io kubectl apply -k '/home/julius/Schreibtisch/kubeflow/pipelines-1.0.0/manifests/kustomize/env/dev'

KFP version: I am using the 1.0.0 release https://github.com/kubeflow/pipelines/releases/tag/1.0.0.

KFP SDK version: [julius@julius-asus ~]$ pip list | grep kfp kfp 1.0.0
kfp-server-api 1.0.0

Anything else you would like to add:

I also experimented with op.file_outputs without success. I also experimented with emptydir and the k8sapi executor without success. I tried newer argo workflow and exec images (2.8.3 and 2.9.3 in deployment/workflow-controller) without success.

So i am wondering why pns is working for others.

Next to the offical examples I am also using some very simple pipelines

@func_to_container_op
def write_numbers_1(numbers_path: OutputPath(str), start: int = 0, count: int = 10):
    import time, datetime
    time.sleep(30) # should not be necessary with newer versions of argo
    '''Write numbers to file'''
    print('numbers_path:', numbers_path)
    with open(numbers_path, 'w') as writer:
        for i in range(start, count):
            writer.write(str(i) + '\n')
    print('finished', datetime.datetime.now())

which work perfectly fine with the docker executor and fail miserably with pns.

See also https://github.com/kubeflow/pipelines/issues/1654#issuecomment-661997829

1654

/kind bug

Bobgy commented 4 years ago

I've been trying pns executor in https://github.com/kubeflow/pipelines/issues/4256 for Kind, k3c environments and it seems they work properly for the builtin data passing basic pipeline.

/cc @Ark-kun do you have any suggestions?

alfsuse commented 4 years ago

@juliusvonkohout could please try to switch the lines
containerRuntimeExecutor: pns, executorImage: gcr.io/ml-pipeline/argoexec:v2.7.5-license-compliance, to

executorImage: gcr.io/ml-pipeline/argoexec:v2.7.5-license-compliance,

containerRuntimeExecutor: pns,

and re-try? Also, did you recreated workflow-controller pod after you updated the executor ?

juliusvonkohout commented 4 years ago

@alfsuse

I changed it with kubectl edit configmap workflow-controller-configmap -n kubeflow to

apiVersion: v1
data:
  config: |
    {
    namespace: kubeflow,
    executorImage: gcr.io/ml-pipeline/argoexec:v2.7.5-license-compliance,
    containerRuntimeExecutor: pns,
    artifactRepository:
...

and deleted the pod with kubectl delete pod workflow-controller-xxx -n kubeflow. I also checked manually that the pod restarted.

It still fails at chroot.

juliusvonkohout commented 4 years ago

@alfsuse I tried it from scratch with k3s (containerd) and pns and it worked.

Interestingly it fails with minikube which is using docker internally.

alfsuse commented 4 years ago

Humm by chance did you have any psp in the previous deployment? I noticed that psp without SYS_PTRACE throw that error

Ark-kun commented 4 years ago

I am using the offical example https://github.com/argoproj/argo/blob/master/examples/artifact-passing.yaml that runs fine out of the box with the argo docker executor.

Then i changed the executor to pns

It might be useful to also report this to Argo. Although, AFAIK, this is a known issue. See https://github.com/argoproj/argo/issues/1256#issuecomment-481471276 Maybe with https://github.com/argoproj/argo/issues/2679 fixed, PNS can work more reliably.

The docker executor works around this by abusing docker.sock to copy the outputs from the terminated main container which is obviously completely infeasible in production.

Can you give more details about the infeasibility?

juliusvonkohout commented 4 years ago

This is my current status

curl -Lo minikube https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64 && chmod +x minikube
./minikube version # 1.12.1 kubernetes 1.18.3 Docker 19.03.2 CRI-O 1.17.3
 ./minikube delete && rm -rf ~/.kube ~/.minikube  # make sure there is no old cluster
./minikube start --memory 8192 --cpus 4 --container-runtime=cri-o --driver=docker --extra-config=apiserver.enable-admission-plugins=PodSecurityPolicy --addons=pod-security-policy
# --driver=kvm2 did not work

# Container runtimes:  Docker, rkt, CRI-O and containerd
# between tests: ./minikube delete && rm -rf ~/.kube ~/.minikube 
# ./minikube start --container-runtime=containerd --driver=docker works with pns
# ./minikube start --container-runtime=docker --driver=docker works with docker and pns
# ./minikube start --container-runtime=CRI-O --driver=docker works with pns
# k3s(containerd) works with pns and is very fast

The only cluster not working is on azure. I will have to check back with the maintainer of the azure cluster and report back here. Btw. the azure cluster is a bit outdated compared to k3s and minikube

  Container Runtime Version:  docker://18.9.1
  Kubelet Version:            v1.17.7

"Humm by chance did you have any psp in the previous deployment? I noticed that psp without SYS_PTRACE throw that error" You will get an error on pod startup if SYS_Ptrace is not enabled. This is the PSP that works for minikube and pns

apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: kubeflow
spec:
  allowPrivilegeEscalation: true
  allowedCapabilities:
  - 'SYS_PTRACE'
  fsGroup:
    rule: RunAsAny
  hostIPC: false
  hostNetwork: false
  hostPID: false
  privileged: false
  runAsUser:
    rule: RunAsAny
  seLinux:
    rule: RunAsAny
  supplementalGroups:
    rule: RunAsAny
  volumes:
  - configMap
  - emptyDir
  - projected
  - secret
  - downwardAPI
  - persistentVolumeClaim

Maybe it helps to add CAP_SYS_CHROOT ? Maye then allowPrivilegeEscalation: true becomes unnecessary

"Can you give more details about the infeasibility?" Well hostpath and docker.sock access is a security issue. You cannot expect anyone to manage your cluster with that security hole.

alfsuse commented 4 years ago

@juliusvonkohout one thing for sure is that with that psp you should have issues with Jupiter notebook spawning if I remember correctly you should also have NET_RAW and NET_ADMIN. So I would say that the fact that there one or more psp deployed could be the root cause along with pns here.

"Maybe it helps to add CAP_SYS_CHROOT" I would go for a test like try '*' as first approach..if everything works you may work in reverse and start to restrict more and more.

Double check on azure if there are also psp there that could affect you in the same way.

juliusvonkohout commented 4 years ago

@alfsuse i am using the pipelines only at the moment.

i still get

[julius@julius-asus ~]$ kubectl --kubeconfig=/home/julius/Downloads/kubeflow-poc-kubeconfig logs conditional-execution-pipeline-with-exit-handler-fkdzb-3942090038 -c wait -n kubeflow
time="2020-07-23T14:38:19Z" level=info msg="Starting Workflow Executor" version=v2.7.5+ede163e.dirty
time="2020-07-23T14:38:19Z" level=info msg="Creating PNS executor (namespace: kubeflow, pod: conditional-execution-pipeline-with-exit-handler-fkdzb-3942090038, pid: 8, hasOutputs: true)"
time="2020-07-23T14:38:19Z" level=info msg="Executor (version: v2.7.5+ede163e.dirty, build_date: 2020-04-21T01:12:08Z) initialized (pod: kubeflow/conditional-execution-pipeline-with-exit-handler-fkdzb-3942090038) with template:\n{\"name\":\"flip-coin-op\",\"arguments\":{},\"inputs\":{},\"outputs\":{\"parameters\":[{\"name\":\"flip-coin-op-Output\",\"valueFrom\":{\"path\":\"/tmp/outputs/Output/data\"}}],\"artifacts\":[{\"name\":\"flip-coin-op-Output\",\"path\":\"/tmp/outputs/Output/data\"}]},\"metadata\":{\"annotations\":{\"pipelines.kubeflow.org/component_ref\":\"{}\",\"pipelines.kubeflow.org/component_spec\":\"{\\\"description\\\": \\\"Flip a coin and output heads or tails randomly.\\\", \\\"implementation\\\": {\\\"container\\\": {\\\"args\\\": [\\\"----output-paths\\\", {\\\"outputPath\\\": \\\"Output\\\"}], \\\"command\\\": [\\\"python3\\\", \\\"-u\\\", \\\"-c\\\", \\\"def flip_coin_op():\\\\n    \\\\\\\"\\\\\\\"\\\\\\\"Flip a coin and output heads or tails randomly.\\\\\\\"\\\\\\\"\\\\\\\"\\\\n    import random\\\\n    result = random.choice(['heads', 'tails'])\\\\n    print(result)\\\\n    return result\\\\n\\\\ndef _serialize_str(str_value: str) -\\u003e str:\\\\n    if not isinstance(str_value, str):\\\\n        raise TypeError('Value \\\\\\\"{}\\\\\\\" has type \\\\\\\"{}\\\\\\\" instead of str.'.format(str(str_value), str(type(str_value))))\\\\n    return str_value\\\\n\\\\nimport argparse\\\\n_parser = argparse.ArgumentParser(prog='Flip coin op', description='Flip a coin and output heads or tails randomly.')\\\\n_parser.add_argument(\\\\\\\"----output-paths\\\\\\\", dest=\\\\\\\"_output_paths\\\\\\\", type=str, nargs=1)\\\\n_parsed_args = vars(_parser.parse_args())\\\\n_output_files = _parsed_args.pop(\\\\\\\"_output_paths\\\\\\\", [])\\\\n\\\\n_outputs = flip_coin_op(**_parsed_args)\\\\n\\\\n_outputs = [_outputs]\\\\n\\\\n_output_serializers = [\\\\n    _serialize_str,\\\\n\\\\n]\\\\n\\\\nimport os\\\\nfor idx, output_file in enumerate(_output_files):\\\\n    try:\\\\n        os.makedirs(os.path.dirname(output_file))\\\\n    except OSError:\\\\n        pass\\\\n    with open(output_file, 'w') as f:\\\\n        f.write(_output_serializers[idx](_outputs[idx]))\\\\n\\\"], \\\"image\\\": \\\"python:3.7\\\"}}, \\\"name\\\": \\\"Flip coin op\\\", \\\"outputs\\\": [{\\\"name\\\": \\\"Output\\\", \\\"type\\\": \\\"String\\\"}]}\",\"sidecar.istio.io/inject\":\"false\"},\"labels\":{\"pipelines.kubeflow.org/cache_enabled\":\"true\"}},\"container\":{\"name\":\"\",\"image\":\"python:3.7\",\"command\":[\"python3\",\"-u\",\"-c\",\"def flip_coin_op():\\n    \\\"\\\"\\\"Flip a coin and output heads or tails randomly.\\\"\\\"\\\"\\n    import random\\n    result = random.choice(['heads', 'tails'])\\n    print(result)\\n    return result\\n\\ndef _serialize_str(str_value: str) -\\u003e str:\\n    if not isinstance(str_value, str):\\n        raise TypeError('Value \\\"{}\\\" has type \\\"{}\\\" instead of str.'.format(str(str_value), str(type(str_value))))\\n    return str_value\\n\\nimport argparse\\n_parser = argparse.ArgumentParser(prog='Flip coin op', description='Flip a coin and output heads or tails randomly.')\\n_parser.add_argument(\\\"----output-paths\\\", dest=\\\"_output_paths\\\", type=str, nargs=1)\\n_parsed_args = vars(_parser.parse_args())\\n_output_files = _parsed_args.pop(\\\"_output_paths\\\", [])\\n\\n_outputs = flip_coin_op(**_parsed_args)\\n\\n_outputs = [_outputs]\\n\\n_output_serializers = [\\n    _serialize_str,\\n\\n]\\n\\nimport os\\nfor idx, output_file in enumerate(_output_files):\\n    try:\\n        os.makedirs(os.path.dirname(output_file))\\n    except OSError:\\n        pass\\n    with open(output_file, 'w') as f:\\n        f.write(_output_serializers[idx](_outputs[idx]))\\n\"],\"args\":[\"----output-paths\",\"/tmp/outputs/Output/data\"],\"resources\":{}},\"archiveLocation\":{\"archiveLogs\":true,\"s3\":{\"endpoint\":\"minio-service.kubeflow:9000\",\"bucket\":\"mlpipeline\",\"insecure\":true,\"accessKeySecret\":{\"name\":\"mlpipeline-minio-artifact\",\"key\":\"accesskey\"},\"secretKeySecret\":{\"name\":\"mlpipeline-minio-artifact\",\"key\":\"secretkey\"},\"key\":\"artifacts/conditional-execution-pipeline-with-exit-handler-fkdzb/conditional-execution-pipeline-with-exit-handler-fkdzb-3942090038\"}}}"
time="2020-07-23T14:38:19Z" level=info msg="Waiting on main container"
time="2020-07-23T14:38:19Z" level=warning msg="Polling root processes (1m0s)"
time="2020-07-23T14:38:19Z" level=info msg="pid 24: &{root 224 2147484013 {682287200 63713278224 0x2d10800} {64772 128 17 16749 0 0 0 0 224 4096 0 {1595451931 150311700} {1577681424 682287200} {1577681424 682287200} [0 0 0]}}"
time="2020-07-23T14:38:19Z" level=info msg="Secured filehandle on /proc/24/root"
time="2020-07-23T14:38:19Z" level=info msg="containerID docker-b76016a16aa672b5aae36b3eb1ca214983410d14c6b1a1e6a4cd9ad68190448f mapped to pid 24"
time="2020-07-23T14:38:19Z" level=info msg="pid 24: &{root 17 2147484141 {797490624 63731111899 0x2d10800} {174 137951787 1 16877 0 0 0 0 17 4096 0 {1595515099 797490624} {1595515099 797490624} {1595515099 959490216} [0 0 0]}}"
time="2020-07-23T14:38:19Z" level=info msg="Secured filehandle on /proc/24/root"
time="2020-07-23T14:38:20Z" level=info msg="pid 24: &{root 17 2147484141 {797490624 63731111899 0x2d10800} {174 137951787 1 16877 0 0 0 0 17 4096 0 {1595515099 797490624} {1595515099 797490624} {1595515099 959490216} [0 0 0]}}"
time="2020-07-23T14:38:20Z" level=info msg="pid 24: &{root 17 2147484141 {797490624 63731111899 0x2d10800} {174 137951787 1 16877 0 0 0 0 17 4096 0 {1595515099 797490624} {1595515099 797490624} {1595515099 959490216} [0 0 0]}}"
time="2020-07-23T14:38:20Z" level=info msg="pid 24: &{root 28 2147484141 {797490624 63731111899 0x2d10800} {174 137951787 1 16877 0 0 0 0 28 4096 0 {1595515100 136489770} {1595515099 797490624} {1595515100 95489873} [0 0 0]}}"
time="2020-07-23T14:38:20Z" level=info msg="pid 24: &{root 28 2147484141 {797490624 63731111899 0x2d10800} {174 137951787 1 16877 0 0 0 0 28 4096 0 {1595515100 136489770} {1595515099 797490624} {1595515100 95489873} [0 0 0]}}"
time="2020-07-23T14:38:20Z" level=info msg="main container started with container ID: b76016a16aa672b5aae36b3eb1ca214983410d14c6b1a1e6a4cd9ad68190448f"
time="2020-07-23T14:38:20Z" level=info msg="Starting annotations monitor"
time="2020-07-23T14:38:20Z" level=info msg="Starting deadline monitor"
time="2020-07-23T14:38:20Z" level=warning msg="Failed to wait for container id 'b76016a16aa672b5aae36b3eb1ca214983410d14c6b1a1e6a4cd9ad68190448f': Failed to determine pid for containerID b76016a16aa672b5aae36b3eb1ca214983410d14c6b1a1e6a4cd9ad68190448f: container may have exited too quickly"
time="2020-07-23T14:38:20Z" level=error msg="executor error: Failed to determine pid for containerID b76016a16aa672b5aae36b3eb1ca214983410d14c6b1a1e6a4cd9ad68190448f: container may have exited too quickly\ngithub.com/argoproj/argo/errors.New\n\t/go/src/github.com/argoproj/argo/errors/errors.go:49\ngithub.com/argoproj/argo/errors.Errorf\n\t/go/src/github.com/argoproj/argo/errors/errors.go:55\ngithub.com/argoproj/argo/errors.InternalErrorf\n\t/go/src/github.com/argoproj/argo/errors/errors.go:65\ngithub.com/argoproj/argo/workflow/executor/pns.(*PNSExecutor).getContainerPID\n\t/go/src/github.com/argoproj/argo/workflow/executor/pns/pns.go:289\ngithub.com/argoproj/argo/workflow/executor/pns.(*PNSExecutor).Wait\n\t/go/src/github.com/argoproj/argo/workflow/executor/pns/pns.go:162\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).Wait.func1\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:829\nk8s.io/apimachinery/pkg/util/wait.ExponentialBackoff\n\t/go/pkg/mod/k8s.io/apimachinery@v0.16.7-beta.0/pkg/util/wait/wait.go:292\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).Wait\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:828\ngithub.com/argoproj/argo/cmd/argoexec/commands.waitContainer\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:40\ngithub.com/argoproj/argo/cmd/argoexec/commands.NewWaitCommand.func1\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:16\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:766\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:852\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:800\nmain.main\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/main.go:17\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"

This happens also with longer running main containers (30+ seconds) it goes from https://github.com/argoproj/argo/blob/13b1d3c19e94047ae97a071e4468b1050b8e292b/workflow/executor/pns/pns.go#L292 is to https://github.com/argoproj/argo/blob/13b1d3c19e94047ae97a071e4468b1050b8e292b/workflow/executor/pns/pns.go#L189 which goes to https://github.com/argoproj/pkg/blob/23e74cb168fe1186222bf7c758fe62e59260dd53/exec/exec.go#L139 The error text comes from here https://github.com/argoproj/argo/blob/13b1d3c19e94047ae97a071e4468b1050b8e292b/workflow/executor/pns/pns.go#L302

  Kernel Version:             4.18.0-193.1.2.el8_2.x86_64
  OS Image:                   Red Hat Enterprise Linux 8.2 (Ootpa)
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://18.9.1
  Kubelet Version:            v1.17.7
  Kube-Proxy Version:         v1.17.7

Here is the output from the minikube cluster

[julius@julius-asus ~]$ kubectl logs file-passing-pipelines-r74xd-3654244453  -c wait -n kubeflow
time="2020-07-23T15:00:11Z" level=info msg="Starting Workflow Executor" version=v2.7.5+ede163e.dirty
time="2020-07-23T15:00:11Z" level=info msg="Creating PNS executor (namespace: kubeflow, pod: file-passing-pipelines-r74xd-3654244453, pid: 23, hasOutputs: true)"
time="2020-07-23T15:00:11Z" level=info msg="Executor (version: v2.7.5+ede163e.dirty, build_date: 2020-04-21T01:12:08Z) initialized (pod: kubeflow/file-passing-pipelines-r74xd-3654244453) with template:\n{\"name\":\"split-text-lines\",\"arguments\":{},\"inputs\":{\"artifacts\":[{\"name\":\"source\",\"path\":\"/tmp/inputs/source/data\",\"raw\":{\"data\":\"one\\ntwo\\nthree\\nfour\\nfive\\nsix\\nseven\\neight\\nnine\\nten\"}}]},\"outputs\":{\"artifacts\":[{\"name\":\"split-text-lines-even_lines\",\"path\":\"/tmp/outputs/even_lines/data\"},{\"name\":\"split-text-lines-odd_lines\",\"path\":\"/tmp/outputs/odd_lines/data\"}]},\"metadata\":{\"annotations\":{\"pipelines.kubeflow.org/component_ref\":\"{}\",\"pipelines.kubeflow.org/component_spec\":\"{\\\"implementation\\\": {\\\"container\\\": {\\\"args\\\": [\\\"--source\\\", {\\\"inputPath\\\": \\\"source\\\"}, \\\"--odd-lines\\\", {\\\"outputPath\\\": \\\"odd_lines\\\"}, \\\"--even-lines\\\", {\\\"outputPath\\\": \\\"even_lines\\\"}], \\\"command\\\": [\\\"python3\\\", \\\"-u\\\", \\\"-c\\\", \\\"def _make_parent_dirs_and_return_path(file_path: str):\\\\n    import os\\\\n    os.makedirs(os.path.dirname(file_path), exist_ok=True)\\\\n    return file_path\\\\n\\\\ndef split_text_lines(source_path, odd_lines_path, even_lines_path):\\\\n    with open(source_path, 'r') as reader:\\\\n        with open(odd_lines_path, 'w') as odd_writer:\\\\n            with open(even_lines_path, 'w') as even_writer:\\\\n                while True:\\\\n                    line = reader.readline()\\\\n                    if line == \\\\\\\"\\\\\\\":\\\\n                        break\\\\n                    odd_writer.write(line)\\\\n                    line = reader.readline()\\\\n                    if line == \\\\\\\"\\\\\\\":\\\\n                        break\\\\n                    even_writer.write(line)\\\\n\\\\nimport argparse\\\\n_parser = argparse.ArgumentParser(prog='Split text lines', description='')\\\\n_parser.add_argument(\\\\\\\"--source\\\\\\\", dest=\\\\\\\"source_path\\\\\\\", type=str, required=True, default=argparse.SUPPRESS)\\\\n_parser.add_argument(\\\\\\\"--odd-lines\\\\\\\", dest=\\\\\\\"odd_lines_path\\\\\\\", type=_make_parent_dirs_and_return_path, required=True, default=argparse.SUPPRESS)\\\\n_parser.add_argument(\\\\\\\"--even-lines\\\\\\\", dest=\\\\\\\"even_lines_path\\\\\\\", type=_make_parent_dirs_and_return_path, required=True, default=argparse.SUPPRESS)\\\\n_parsed_args = vars(_parser.parse_args())\\\\n\\\\n_outputs = split_text_lines(**_parsed_args)\\\\n\\\"], \\\"image\\\": \\\"python:3.7\\\"}}, \\\"inputs\\\": [{\\\"name\\\": \\\"source\\\", \\\"type\\\": \\\"String\\\"}], \\\"name\\\": \\\"Split text lines\\\", \\\"outputs\\\": [{\\\"name\\\": \\\"odd_lines\\\", \\\"type\\\": \\\"String\\\"}, {\\\"name\\\": \\\"even_lines\\\", \\\"type\\\": \\\"String\\\"}]}\",\"sidecar.istio.io/inject\":\"false\"},\"labels\":{\"pipelines.kubeflow.org/cache_enabled\":\"true\"}},\"container\":{\"name\":\"\",\"image\":\"python:3.7\",\"command\":[\"python3\",\"-u\",\"-c\",\"def _make_parent_dirs_and_return_path(file_path: str):\\n    import os\\n    os.makedirs(os.path.dirname(file_path), exist_ok=True)\\n    return file_path\\n\\ndef split_text_lines(source_path, odd_lines_path, even_lines_path):\\n    with open(source_path, 'r') as reader:\\n        with open(odd_lines_path, 'w') as odd_writer:\\n            with open(even_lines_path, 'w') as even_writer:\\n                while True:\\n                    line = reader.readline()\\n                    if line == \\\"\\\":\\n                        break\\n                    odd_writer.write(line)\\n                    line = reader.readline()\\n                    if line == \\\"\\\":\\n                        break\\n                    even_writer.write(line)\\n\\nimport argparse\\n_parser = argparse.ArgumentParser(prog='Split text lines', description='')\\n_parser.add_argument(\\\"--source\\\", dest=\\\"source_path\\\", type=str, required=True, default=argparse.SUPPRESS)\\n_parser.add_argument(\\\"--odd-lines\\\", dest=\\\"odd_lines_path\\\", type=_make_parent_dirs_and_return_path, required=True, default=argparse.SUPPRESS)\\n_parser.add_argument(\\\"--even-lines\\\", dest=\\\"even_lines_path\\\", type=_make_parent_dirs_and_return_path, required=True, default=argparse.SUPPRESS)\\n_parsed_args = vars(_parser.parse_args())\\n\\n_outputs = split_text_lines(**_parsed_args)\\n\"],\"args\":[\"--source\",\"/tmp/inputs/source/data\",\"--odd-lines\",\"/tmp/outputs/odd_lines/data\",\"--even-lines\",\"/tmp/outputs/even_lines/data\"],\"resources\":{}},\"archiveLocation\":{\"archiveLogs\":true,\"s3\":{\"endpoint\":\"minio-service.kubeflow:9000\",\"bucket\":\"mlpipeline\",\"insecure\":true,\"accessKeySecret\":{\"name\":\"mlpipeline-minio-artifact\",\"key\":\"accesskey\"},\"secretKeySecret\":{\"name\":\"mlpipeline-minio-artifact\",\"key\":\"secretkey\"},\"key\":\"artifacts/file-passing-pipelines-r74xd/file-passing-pipelines-r74xd-3654244453\"}}}"
time="2020-07-23T15:00:11Z" level=info msg="Waiting on main container"
time="2020-07-23T15:00:11Z" level=warning msg="Polling root processes (1m0s)"
time="2020-07-23T15:00:11Z" level=info msg="pid 41: &{root 4096 2147484141 {984089037 63731108484 0x2d10800} {65 17847261 1 16877 0 0 0 0 4096 4096 16 {1595511683 105074911} {1595511684 984089037} {1595511793 806924664} [0 0 0]}}"
time="2020-07-23T15:00:11Z" level=info msg="Secured filehandle on /proc/41/root"
time="2020-07-23T15:00:11Z" level=info msg="containerID 2aa2f45135eedd0e33503a8bad76a37bdacf1e58ae6b15aadbfc640d78ea4a2c mapped to pid 41"
time="2020-07-23T15:00:11Z" level=info msg="pid 41: &{root 4096 2147484141 {337868407 63731113211 0x2d10800} {2097192 13768451 1 16877 0 0 0 0 4096 4096 16 {1595516411 439869256} {1595516411 337868407} {1595516411 493869706} [0 0 0]}}"
time="2020-07-23T15:00:11Z" level=info msg="Secured filehandle on /proc/41/root"
time="2020-07-23T15:00:11Z" level=info msg="pid 41: &{root 4096 2147484141 {337868407 63731113211 0x2d10800} {2097192 13768451 1 16877 0 0 0 0 4096 4096 16 {1595516411 439869256} {1595516411 337868407} {1595516411 544870131} [0 0 0]}}"
time="2020-07-23T15:00:11Z" level=info msg="pid 41: &{root 4096 2147484141 {337868407 63731113211 0x2d10800} {2097192 13768451 1 16877 0 0 0 0 4096 4096 16 {1595516411 439869256} {1595516411 337868407} {1595516411 544870131} [0 0 0]}}"
time="2020-07-23T15:00:11Z" level=info msg="pid 41: &{root 4096 2147484141 {337868407 63731113211 0x2d10800} {2097192 13768451 1 16877 0 0 0 0 4096 4096 16 {1595516411 439869256} {1595516411 337868407} {1595516411 544870131} [0 0 0]}}"
time="2020-07-23T15:00:11Z" level=info msg="pid 41: &{root 4096 2147484141 {337868407 63731113211 0x2d10800} {2097192 13768451 1 16877 0 0 0 0 4096 4096 16 {1595516411 439869256} {1595516411 337868407} {1595516411 544870131} [0 0 0]}}"
time="2020-07-23T15:00:12Z" level=info msg="main container started with container ID: 2aa2f45135eedd0e33503a8bad76a37bdacf1e58ae6b15aadbfc640d78ea4a2c"
time="2020-07-23T15:00:12Z" level=info msg="Starting annotations monitor"
time="2020-07-23T15:00:12Z" level=info msg="Main pid identified as 41"
time="2020-07-23T15:00:12Z" level=info msg="Successfully secured file handle on main container root filesystem"
time="2020-07-23T15:00:12Z" level=info msg="Waiting for main pid 41 to complete"
time="2020-07-23T15:00:12Z" level=info msg="Starting deadline monitor"
time="2020-07-23T15:00:12Z" level=info msg="Stopped root processes polling due to successful securing of main root fs"
time="2020-07-23T15:00:13Z" level=info msg="Main pid 41 completed"
time="2020-07-23T15:00:13Z" level=info msg="Main container completed"
time="2020-07-23T15:00:13Z" level=info msg="Saving logs"
time="2020-07-23T15:00:13Z" level=info msg="Annotations monitor stopped"
time="2020-07-23T15:00:13Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: artifacts/file-passing-pipelines-r74xd/file-passing-pipelines-r74xd-3654244453/main.log"
time="2020-07-23T15:00:13Z" level=info msg="Creating minio client minio-service.kubeflow:9000 using static credentials"
time="2020-07-23T15:00:13Z" level=info msg="Saving from /tmp/argo/outputs/logs/main.log to s3 (endpoint: minio-service.kubeflow:9000, bucket: mlpipeline, key: artifacts/file-passing-pipelines-r74xd/file-passing-pipelines-r74xd-3654244453/main.log)"
time="2020-07-23T15:00:13Z" level=info msg="No output parameters"
time="2020-07-23T15:00:13Z" level=info msg="Saving output artifacts"
time="2020-07-23T15:00:13Z" level=info msg="Staging artifact: split-text-lines-even_lines"
time="2020-07-23T15:00:13Z" level=info msg="Copying /tmp/outputs/even_lines/data from container base image layer to /tmp/argo/outputs/artifacts/split-text-lines-even_lines.tgz"
time="2020-07-23T15:00:13Z" level=info msg="Taring /tmp/outputs/even_lines/data"
time="2020-07-23T15:00:13Z" level=info msg="S3 Save path: /tmp/argo/outputs/artifacts/split-text-lines-even_lines.tgz, key: artifacts/file-passing-pipelines-r74xd/file-passing-pipelines-r74xd-3654244453/split-text-lines-even_lines.tgz"
  Kernel Version:             5.7.9-200.fc32.x86_64
  OS Image:                   Ubuntu 19.10
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  cri-o://1.17.3
  Kubelet Version:            v1.18.3
  Kube-Proxy Version:         v1.18.3
Bobgy commented 4 years ago

I managed to reproduce "Failed to wait for container id 'b76016a16aa672b5aae36b3eb1ca214983410d14c6b1a1e6a4cd9ad68190448f': Failed to determine pid for containerID b76016a16aa672b5aae36b3eb1ca214983410d14c6b1a1e6a4cd9ad68190448f: container may have exited too quickly" error once with Docker Desktop + pns executor, but it fails randomly instead of always failing with this.

Ark-kun commented 4 years ago

I managed to reproduce "Failed to wait for container id 'b76016a16aa672b5aae36b3eb1ca214983410d14c6b1a1e6a4cd9ad68190448f': Failed to determine pid for containerID b76016a16aa672b5aae36b3eb1ca214983410d14c6b1a1e6a4cd9ad68190448f: container may have exited too quickly" error once with Docker Desktop + pns executor, but it fails randomly instead of always failing with this.

Can you try mounting an emptyDir: {} volume at the output artifact directory (e.g. /tmp/outputs). I wonder whether this fixes the issue.

juliusvonkohout commented 4 years ago

I already tried this before, but just to be sure i ran it again. The wait container errors while the main container is still running...

def write_numbers_1(numbers_path: OutputPath(str), start: int = 0, count: int = 10):
    import time, datetime
    time.sleep(30)
    '''Write numbers to file'''
    print('numbers_path:', numbers_path)
    with open(numbers_path, 'w') as writer:
        for i in range(start, count):
            writer.write(str(i) + '\n')
    print('finished', datetime.datetime.now())

def add_emptydir(op):
    from kubernetes import client as k8s_client
    op.add_volume(k8s_client.V1Volume(name='outputs', empty_dir=k8s_client.V1EmptyDirVolumeSource()))
    op.container.add_volume_mount(k8s_client.V1VolumeMount(name='outputs', mount_path='tmp/outputs'))
    return op

write_numbers_component  = add_emptydir(func_to_container_op(func=write_numbers_1)(count=count))
def sum_pipeline(count: int = 100000):
    write_numbers_component  = add_emptydir(func_to_container_op(func=write_numbers_1)(count=count))

Logs on azure cluster:

[julius@julius-asus ~]$ kubectl --kubeconfig=/home/julius/Downloads/kubeflow-poc-kubeconfig logs sum-pipeline-z9894-1392252490 -c wait -n kubeflow
time="2020-07-24T09:42:58Z" level=info msg="Starting Workflow Executor" version=v2.7.5+ede163e.dirty
time="2020-07-24T09:42:58Z" level=info msg="Creating PNS executor (namespace: kubeflow, pod: sum-pipeline-z9894-1392252490, pid: 8, hasOutputs: true)"
time="2020-07-24T09:42:58Z" level=info msg="Executor (version: v2.7.5+ede163e.dirty, build_date: 2020-04-21T01:12:08Z) initialized (pod: kubeflow/sum-pipeline-z9894-1392252490) with template:\n{\"name\":\"write-numbers-1\",\"arguments\":{},\"inputs\":{\"parameters\":[{\"name\":\"count\",\"value\":\"100000\"}]},\"outputs\":{\"artifacts\":[{\"name\":\"write-numbers-1-numbers\",\"path\":\"/tmp/outputs/numbers/data\"}]},\"metadata\":{\"annotations\":{\"pipelines.kubeflow.org/component_ref\":\"{}\",\"pipelines.kubeflow.org/component_spec\":\"{\\\"implementation\\\": {\\\"container\\\": {\\\"args\\\": [{\\\"if\\\": {\\\"cond\\\": {\\\"isPresent\\\": \\\"start\\\"}, \\\"then\\\": [\\\"--start\\\", {\\\"inputValue\\\": \\\"start\\\"}]}}, {\\\"if\\\": {\\\"cond\\\": {\\\"isPresent\\\": \\\"count\\\"}, \\\"then\\\": [\\\"--count\\\", {\\\"inputValue\\\": \\\"count\\\"}]}}, \\\"--numbers\\\", {\\\"outputPath\\\": \\\"numbers\\\"}], \\\"command\\\": [\\\"python3\\\", \\\"-u\\\", \\\"-c\\\", \\\"def _make_parent_dirs_and_return_path(file_path: str):\\\\n    import os\\\\n    os.makedirs(os.path.dirname(file_path), exist_ok=True)\\\\n    return file_path\\\\n\\\\ndef write_numbers_1(numbers_path, start = 0, count = 10):\\\\n    import time, datetime\\\\n    time.sleep(30)\\\\n    '''Write numbers to file'''\\\\n    print('numbers_path:', numbers_path)\\\\n    with open(numbers_path, 'w') as writer:\\\\n        for i in range(start, count):\\\\n            writer.write(str(i) + '\\\\\\\\n')\\\\n    print('finished', datetime.datetime.now())\\\\n\\\\nimport argparse\\\\n_parser = argparse.ArgumentParser(prog='Write numbers 1', description='')\\\\n_parser.add_argument(\\\\\\\"--start\\\\\\\", dest=\\\\\\\"start\\\\\\\", type=int, required=False, default=argparse.SUPPRESS)\\\\n_parser.add_argument(\\\\\\\"--count\\\\\\\", dest=\\\\\\\"count\\\\\\\", type=int, required=False, default=argparse.SUPPRESS)\\\\n_parser.add_argument(\\\\\\\"--numbers\\\\\\\", dest=\\\\\\\"numbers_path\\\\\\\", type=_make_parent_dirs_and_return_path, required=True, default=argparse.SUPPRESS)\\\\n_parsed_args = vars(_parser.parse_args())\\\\n\\\\n_outputs = write_numbers_1(**_parsed_args)\\\\n\\\"], \\\"image\\\": \\\"python:3.7\\\"}}, \\\"inputs\\\": [{\\\"default\\\": \\\"0\\\", \\\"name\\\": \\\"start\\\", \\\"optional\\\": true, \\\"type\\\": \\\"Integer\\\"}, {\\\"default\\\": \\\"10\\\", \\\"name\\\": \\\"count\\\", \\\"optional\\\": true, \\\"type\\\": \\\"Integer\\\"}], \\\"name\\\": \\\"Write numbers 1\\\", \\\"outputs\\\": [{\\\"name\\\": \\\"numbers\\\", \\\"type\\\": \\\"String\\\"}]}\",\"sidecar.istio.io/inject\":\"false\"},\"labels\":{\"pipelines.kubeflow.org/cache_enabled\":\"true\"}},\"container\":{\"name\":\"\",\"image\":\"python:3.7\",\"command\":[\"python3\",\"-u\",\"-c\",\"def _make_parent_dirs_and_return_path(file_path: str):\\n    import os\\n    os.makedirs(os.path.dirname(file_path), exist_ok=True)\\n    return file_path\\n\\ndef write_numbers_1(numbers_path, start = 0, count = 10):\\n    import time, datetime\\n    time.sleep(30)\\n    '''Write numbers to file'''\\n    print('numbers_path:', numbers_path)\\n    with open(numbers_path, 'w') as writer:\\n        for i in range(start, count):\\n            writer.write(str(i) + '\\\\n')\\n    print('finished', datetime.datetime.now())\\n\\nimport argparse\\n_parser = argparse.ArgumentParser(prog='Write numbers 1', description='')\\n_parser.add_argument(\\\"--start\\\", dest=\\\"start\\\", type=int, required=False, default=argparse.SUPPRESS)\\n_parser.add_argument(\\\"--count\\\", dest=\\\"count\\\", type=int, required=False, default=argparse.SUPPRESS)\\n_parser.add_argument(\\\"--numbers\\\", dest=\\\"numbers_path\\\", type=_make_parent_dirs_and_return_path, required=True, default=argparse.SUPPRESS)\\n_parsed_args = vars(_parser.parse_args())\\n\\n_outputs = write_numbers_1(**_parsed_args)\\n\"],\"args\":[\"--count\",\"100000\",\"--numbers\",\"/tmp/outputs/numbers/data\"],\"resources\":{},\"volumeMounts\":[{\"name\":\"outputs\",\"mountPath\":\"tmp/outputs\"}]},\"volumes\":[{\"name\":\"outputs\",\"emptyDir\":{}}],\"archiveLocation\":{\"archiveLogs\":true,\"s3\":{\"endpoint\":\"minio-service.kubeflow:9000\",\"bucket\":\"mlpipeline\",\"insecure\":true,\"accessKeySecret\":{\"name\":\"mlpipeline-minio-artifact\",\"key\":\"accesskey\"},\"secretKeySecret\":{\"name\":\"mlpipeline-minio-artifact\",\"key\":\"secretkey\"},\"key\":\"artifacts/sum-pipeline-z9894/sum-pipeline-z9894-1392252490\"}}}"
time="2020-07-24T09:42:58Z" level=info msg="Waiting on main container"
time="2020-07-24T09:42:58Z" level=warning msg="Polling root processes (1m0s)"
time="2020-07-24T09:42:58Z" level=info msg="pid 24: &{root 28 2147484141 {206541979 63731180578 0x2d10800} {221 3073358 1 16877 0 0 0 0 28 4096 0 {1595583778 206541979} {1595583778 206541979} {1595583778 344543904} [0 0 0]}}"
time="2020-07-24T09:42:58Z" level=info msg="Secured filehandle on /proc/24/root"
time="2020-07-24T09:42:58Z" level=info msg="containerID docker-84b0f3b2495718f5c5e00b7fbd8ebbb77600f265742ac1e5f95d1eaba22bf1c6 mapped to pid 24"
time="2020-07-24T09:42:58Z" level=info msg="pid 24: &{root 28 2147484141 {206541979 63731180578 0x2d10800} {221 3073358 1 16877 0 0 0 0 28 4096 0 {1595583778 206541979} {1595583778 206541979} {1595583778 344543904} [0 0 0]}}"
time="2020-07-24T09:42:58Z" level=info msg="pid 24: &{root 28 2147484141 {206541979 63731180578 0x2d10800} {221 3073358 1 16877 0 0 0 0 28 4096 0 {1595583778 206541979} {1595583778 206541979} {1595583778 344543904} [0 0 0]}}"
time="2020-07-24T09:42:58Z" level=info msg="pid 24: &{root 39 2147484141 {206541979 63731180578 0x2d10800} {221 3073358 1 16877 0 0 0 0 39 4096 0 {1595583778 206541979} {1595583778 206541979} {1595583778 479545788} [0 0 0]}}"
time="2020-07-24T09:42:58Z" level=info msg="pid 24: &{root 39 2147484141 {206541979 63731180578 0x2d10800} {221 3073358 1 16877 0 0 0 0 39 4096 0 {1595583778 517546319} {1595583778 206541979} {1595583778 479545788} [0 0 0]}}"
time="2020-07-24T09:42:58Z" level=info msg="pid 24: &{root 39 2147484141 {206541979 63731180578 0x2d10800} {221 3073358 1 16877 0 0 0 0 39 4096 0 {1595583778 517546319} {1595583778 206541979} {1595583778 479545788} [0 0 0]}}"
time="2020-07-24T09:42:58Z" level=info msg="pid 24: &{root 39 2147484141 {206541979 63731180578 0x2d10800} {221 3073358 1 16877 0 0 0 0 39 4096 0 {1595583778 517546319} {1595583778 206541979} {1595583778 479545788} [0 0 0]}}"
time="2020-07-24T09:42:58Z" level=info msg="pid 24: &{root 39 2147484141 {206541979 63731180578 0x2d10800} {221 3073358 1 16877 0 0 0 0 39 4096 0 {1595583778 517546319} {1595583778 206541979} {1595583778 479545788} [0 0 0]}}"
time="2020-07-24T09:42:58Z" level=info msg="pid 24: &{root 39 2147484141 {206541979 63731180578 0x2d10800} {221 3073358 1 16877 0 0 0 0 39 4096 0 {1595583778 517546319} {1595583778 206541979} {1595583778 479545788} [0 0 0]}}"
time="2020-07-24T09:42:58Z" level=info msg="pid 24: &{root 39 2147484141 {206541979 63731180578 0x2d10800} {221 3073358 1 16877 0 0 0 0 39 4096 0 {1595583778 517546319} {1595583778 206541979} {1595583778 479545788} [0 0 0]}}"
time="2020-07-24T09:42:58Z" level=info msg="pid 24: &{root 39 2147484141 {206541979 63731180578 0x2d10800} {221 3073358 1 16877 0 0 0 0 39 4096 0 {1595583778 517546319} {1595583778 206541979} {1595583778 479545788} [0 0 0]}}"
time="2020-07-24T09:42:58Z" level=info msg="pid 24: &{root 39 2147484141 {206541979 63731180578 0x2d10800} {221 3073358 1 16877 0 0 0 0 39 4096 0 {1595583778 517546319} {1595583778 206541979} {1595583778 479545788} [0 0 0]}}"
time="2020-07-24T09:42:58Z" level=info msg="pid 24: &{root 39 2147484141 {206541979 63731180578 0x2d10800} {221 3073358 1 16877 0 0 0 0 39 4096 0 {1595583778 517546319} {1595583778 206541979} {1595583778 479545788} [0 0 0]}}"
time="2020-07-24T09:42:59Z" level=info msg="pid 24: &{root 39 2147484141 {206541979 63731180578 0x2d10800} {221 3073358 1 16877 0 0 0 0 39 4096 0 {1595583778 517546319} {1595583778 206541979} {1595583778 479545788} [0 0 0]}}"
time="2020-07-24T09:42:59Z" level=info msg="pid 24: &{root 39 2147484141 {206541979 63731180578 0x2d10800} {221 3073358 1 16877 0 0 0 0 39 4096 0 {1595583778 517546319} {1595583778 206541979} {1595583778 479545788} [0 0 0]}}"
time="2020-07-24T09:42:59Z" level=info msg="pid 24: &{root 39 2147484141 {206541979 63731180578 0x2d10800} {221 3073358 1 16877 0 0 0 0 39 4096 0 {1595583778 517546319} {1595583778 206541979} {1595583778 479545788} [0 0 0]}}"
time="2020-07-24T09:42:59Z" level=info msg="pid 24: &{root 39 2147484141 {206541979 63731180578 0x2d10800} {221 3073358 1 16877 0 0 0 0 39 4096 0 {1595583778 517546319} {1595583778 206541979} {1595583778 479545788} [0 0 0]}}"
time="2020-07-24T09:42:59Z" level=info msg="pid 24: &{root 39 2147484141 {206541979 63731180578 0x2d10800} {221 3073358 1 16877 0 0 0 0 39 4096 0 {1595583778 517546319} {1595583778 206541979} {1595583778 479545788} [0 0 0]}}"
time="2020-07-24T09:42:59Z" level=info msg="main container started with container ID: 84b0f3b2495718f5c5e00b7fbd8ebbb77600f265742ac1e5f95d1eaba22bf1c6"
time="2020-07-24T09:42:59Z" level=info msg="Starting annotations monitor"
time="2020-07-24T09:42:59Z" level=info msg="Starting deadline monitor"
time="2020-07-24T09:42:59Z" level=info msg="pid 24: &{root 39 2147484141 {206541979 63731180578 0x2d10800} {221 3073358 1 16877 0 0 0 0 39 4096 0 {1595583778 517546319} {1595583778 206541979} {1595583778 479545788} [0 0 0]}}"
time="2020-07-24T09:42:59Z" level=warning msg="Failed to wait for container id '84b0f3b2495718f5c5e00b7fbd8ebbb77600f265742ac1e5f95d1eaba22bf1c6': Failed to determine pid for containerID 84b0f3b2495718f5c5e00b7fbd8ebbb77600f265742ac1e5f95d1eaba22bf1c6: container may have exited too quickly"
time="2020-07-24T09:42:59Z" level=error msg="executor error: Failed to determine pid for containerID 84b0f3b2495718f5c5e00b7fbd8ebbb77600f265742ac1e5f95d1eaba22bf1c6: container may have exited too quickly\ngithub.com/argoproj/argo/errors.New\n\t/go/src/github.com/argoproj/argo/errors/errors.go:49\ngithub.com/argoproj/argo/errors.Errorf\n\t/go/src/github.com/argoproj/argo/errors/errors.go:55\ngithub.com/argoproj/argo/errors.InternalErrorf\n\t/go/src/github.com/argoproj/argo/errors/errors.go:65\ngithub.com/argoproj/argo/workflow/executor/pns.(*PNSExecutor).getContainerPID\n\t/go/src/github.com/argoproj/argo/workflow/executor/pns/pns.go:289\ngithub.com/argoproj/argo/workflow/executor/pns.(*PNSExecutor).Wait\n\t/go/src/github.com/argoproj/argo/workflow/executor/pns/pns.go:162\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).Wait.func1\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:829\nk8s.io/apimachinery/pkg/util/wait.ExponentialBackoff\n\t/go/pkg/mod/k8s.io/apimachinery@v0.16.7-beta.0/pkg/util/wait/wait.go:292\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).Wait\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:828\ngithub.com/argoproj/argo/cmd/argoexec/commands.waitContainer\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:40\ngithub.com/argoproj/argo/cmd/argoexec/commands.NewWaitCommand.func1\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:16\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:766\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:852\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:800\nmain.main\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/main.go:17\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"
time="2020-07-24T09:42:59Z" level=info msg="Saving logs"
time="2020-07-24T09:42:59Z" level=info msg="Annotations monitor stopped"
time="2020-07-24T09:42:59Z" level=info msg="pid 24: &{root 39 2147484141 {206541979 63731180578 0x2d10800} {221 3073358 1 16877 0 0 0 0 39 4096 0 {1595583778 517546319} {1595583778 206541979} {1595583778 479545788} [0 0 0]}}"
time="2020-07-24T09:42:59Z" level=info msg="pid 24: &{root 39 2147484141 {206541979 63731180578 0x2d10800} {221 3073358 1 16877 0 0 0 0 39 4096 0 {1595583778 517546319} {1595583778 206541979} {1595583778 479545788} [0 0 0]}}"
time="2020-07-24T09:42:59Z" level=info msg="pid 24: &{root 39 2147484141 {206541979 63731180578 0x2d10800} {221 3073358 1 16877 0 0 0 0 39 4096 0 {1595583778 517546319} {1595583778 206541979} {1595583778 479545788} [0 0 0]}}"

...

time="2020-07-24T09:43:00Z" level=info msg="pid 24: &{root 39 2147484141 {206541979 63731180578 0x2d10800} {221 3073358 1 16877 0 0 0 0 39 4096 0 {1595583778 517546319} {1595583778 206541979} {1595583778 479545788} [0 0 0]}}"
time="2020-07-24T09:43:00Z" level=info msg="Deadline monitor stopped"
time="2020-07-24T09:43:00Z" level=info msg="pid 24: &{root 39 2147484141 {206541979 63731180578 0x2d10800} {221 3073358 1 16877 0 0 0 0 39 4096 0 {1595583778 517546319} {1595583778 206541979} {1595583778 479545788} [0 0 0]}}"

...

time="2020-07-24T09:43:26Z" level=info msg="pid 24: &{root 39 2147484141 {206541979 63731180578 0x2d10800} {221 3073358 1 16877 0 0 0 0 39 4096 0 {1595583778 517546319} {1595583778 206541979} {1595583778 479545788} [0 0 0]}}"
3073358 1 16877 0 0 0 0 39 4096 0 {1595583778 517546319} {1595583778 206541979} {1595583778 479545788} [0 0 0]}}"
time="2020-07-24T09:43:29Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: artifacts/sum-pipeline-z9894/sum-pipeline-z9894-1392252490/main.log"
time="2020-07-24T09:43:29Z" level=info msg="Creating minio client minio-service.kubeflow:9000 using static credentials"
time="2020-07-24T09:43:29Z" level=info msg="Saving from /tmp/argo/outputs/logs/main.log to s3 (endpoint: minio-service.kubeflow:9000, bucket: mlpipeline, key: artifacts/sum-pipeline-z9894/sum-pipeline-z9894-1392252490/main.log)"
time="2020-07-24T09:43:30Z" level=info msg="No output parameters"
time="2020-07-24T09:43:30Z" level=info msg="Saving output artifacts"
time="2020-07-24T09:43:30Z" level=info msg="Staging artifact: write-numbers-1-numbers"
time="2020-07-24T09:43:30Z" level=info msg="Copying /tmp/outputs/numbers/data from container base image layer to /tmp/argo/outputs/artifacts/write-numbers-1-numbers.tgz"
time="2020-07-24T09:43:30Z" level=error msg="executor error: could not chroot into main for artifact collection: container may have exited too quickly\ngithub.com/argoproj/argo/errors.New\n\t/go/src/github.com/argoproj/argo/errors/errors.go:49\ngithub.com/argoproj/argo/errors.Errorf\n\t/go/src/github.com/argoproj/argo/errors/errors.go:55\ngithub.com/argoproj/argo/errors.InternalErrorf\n\t/go/src/github.com/argoproj/argo/errors/errors.go:65\ngithub.com/argoproj/argo/workflow/executor/pns.(*PNSExecutor).enterChroot\n\t/go/src/github.com/argoproj/argo/workflow/executor/pns/pns.go:94\ngithub.com/argoproj/argo/workflow/executor/pns.(*PNSExecutor).CopyFile\n\t/go/src/github.com/argoproj/argo/workflow/executor/pns/pns.go:136\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).stageArchiveFile\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:347\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).saveArtifact\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:240\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).SaveArtifacts\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:226\ngithub.com/argoproj/argo/cmd/argoexec/commands.waitContainer\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:59\ngithub.com/argoproj/argo/cmd/argoexec/commands.NewWaitCommand.func1\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:16\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:766\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:852\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:800\nmain.main\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/main.go:17\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"
time="2020-07-24T09:43:30Z" level=info msg="Killing sidecars"
time="2020-07-24T09:43:30Z" level=info msg="Alloc=5635 TotalAlloc=35428 Sys=70592 NumGC=11 Goroutines=10"
time="2020-07-24T09:43:30Z" level=fatal msg="could not chroot into main for artifact collection: container may have exited too quickly\ngithub.com/argoproj/argo/errors.New\n\t/go/src/github.com/argoproj/argo/errors/errors.go:49\ngithub.com/argoproj/argo/errors.Errorf\n\t/go/src/github.com/argoproj/argo/errors/errors.go:55\ngithub.com/argoproj/argo/errors.InternalErrorf\n\t/go/src/github.com/argoproj/argo/errors/errors.go:65\ngithub.com/argoproj/argo/workflow/executor/pns.(*PNSExecutor).enterChroot\n\t/go/src/github.com/argoproj/argo/workflow/executor/pns/pns.go:94\ngithub.com/argoproj/argo/workflow/executor/pns.(*PNSExecutor).CopyFile\n\t/go/src/github.com/argoproj/argo/workflow/executor/pns/pns.go:136\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).stageArchiveFile\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:347\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).saveArtifact\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:240\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).SaveArtifacts\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:226\ngithub.com/argoproj/argo/cmd/argoexec/commands.waitContainer\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:59\ngithub.com/argoproj/argo/cmd/argoexec/commands.NewWaitCommand.func1\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:16\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:766\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:852\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:800\nmain.main\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/main.go:17\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"
[julius@julius-asus ~]$ kubectl --kubeconfig=/home/julius/Downloads/kubeflow-poc-kubeconfig logs sum-pipeline-z9894-1392252490 -c main -n kubeflow
numbers_path: /tmp/outputs/numbers/data
finished 2020-07-24 09:43:28.690058
[julius@julius-asus ~]$ 
ggogel commented 4 years ago

I'm also experiencing this issue. I'm running K3OS which comes with CR-IO runtime. The pns executor was the only one that I got working at all.

juliusvonkohout commented 4 years ago

i got feedback from the azure cluster maintainer and he acknowledged that the cluster is partially broken. I also tested PNS successfully on GCP with PodSecurityPolicies. I will report back here if i get access to the new Azure cluster.

@ggogel i had success with minikube and CRI-O. Maybe try that first.

ggogel commented 4 years ago

@juliusvonkohout I somewhat got it working with the k8sapi, but if I submit the workflow it fails and then only after retry it succeeds. I switched to a VolumeClaim and it works fine. Artifact passing on executors other than the docker executor seems to be immature.

juliusvonkohout commented 4 years ago

@alfsuse @alexec @jessesuen

i enabled debugging and noticed in the logs, that there is a "docker-" prefix for containerIDs on that specific cluster. E.g.

"containerID docker-be1b9699927f457b755fd8a2038a3e2003618c6e5fdf23f3bd203eb771d92896 mapped to pid 20"

In https://github.com/argoproj/argo/blob/79217bc89e892ee82bdd5018b2bba65425924d36/workflow/executor/pns/pns.go#L380 there is trimming code for prefixes, but only for cri-o, not docker.

// for crio we need to get rid of "crio-" prefix and ".scope" suffix
// e.g. crio-7a92a067289f6197148912be1c15f20f0330c7f3c541473d3b9c4043ca137b42.scope

a bit later it fails with

time="2020-08-25T12:40:38Z" level=debug msg="{main {nil nil &ContainerStateTerminated{ExitCode:0,Signal:0,Reason:Completed,Message:,StartedAt:2020-08-25 12:40:38 +0000 UTC,FinishedAt:2020-08-25 12:40:38 +0000 UTC,ContainerID:docker://be1b9699927f457b755fd8a2038a3e2003618c6e5fdf23f3bd203eb771d92896,}} {nil nil nil} false 0 python:3.7 docker-pullable://python@sha256:0bb57c218e452ace5a4d55ee3c56e3c57e29c7f43cf55192acf7289becff4b0e docker://be1b9699927f457b755fd8a2038a3e2003618c6e5fdf23f3bd203eb771d92896 0xc0001ce32a}"
time="2020-08-25T12:40:38Z" level=info msg="main container started with container ID: be1b9699927f457b755fd8a2038a3e2003618c6e5fdf23f3bd203eb771d92896"
time="2020-08-25T12:40:38Z" level=info msg="Starting annotations monitor"
I0825 12:40:38.926250       6 round_trippers.go:438] GET https://192.168.0.1:443/api/v1/namespaces/kubeflow/pods/conditional-execution-pipeline-with-exit-handler-c576r-3228191410 200 OK in 8 milliseconds
time="2020-08-25T12:40:38Z" level=warning msg="Failed to wait for container id 'be1b9699927f457b755fd8a2038a3e2003618c6e5fdf23f3bd203eb771d92896': Failed to determine pid for containerID be1b9699927f457b755fd8a2038a3e2003618c6e5fdf23f3bd203eb771d92896: container may have exited too quickly"

debug_log.txt describe.txt

The premissions are also ok, AFAIK Bildschirmfoto von 2020-08-25 15-34-51

Ark-kun commented 4 years ago

@juliusvonkohout Can you please report this issue to Argo? They're pretty responsive now and have first-hand experience with different executors. Please link the issue here so that we can upvote it.

juliusvonkohout commented 3 years ago

See https://github.com/argoproj/argo/issues/4186#issuecomment-713019571 so rootles (just PTRACE and CHROOT capabilities) is possible