kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.59k stars 1.62k forks source link

[backend] Pipeline first step stuck in running state even after completing #7132

Closed RobinKa closed 2 years ago

RobinKa commented 2 years ago

Hey, hope this is the right place to post this issue at. I'm new to Kubeflow and Kubernetes so please let me know what else would be useful to know.

Environment

Steps to reproduce

  1. Install KubeFlow on Kubernetes 1.19 (used K3S) with manifests, full setup script in materials below
  2. Go to Kubeflow dashboard
  3. Start a Pipeline run for [Tutorial] DSL - Control structures
  4. First step completes successfully (eg. logs "tails"), but stays stuck in running state

Terminating the run does nothing. I also tried running other pipelines and the result is the same.

image

Expected result

The pipeline step should complete and run the rest of the pipeline.

Materials and Reference

Setup on Ubuntu 20.04 server from scratch

sudo apt update -y && sudo apt upgrade -y

# Install docker
sudo apt install ca-certificates curl gnupg lsb-release
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu \
  $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io

# Install k3s 1.19 (I tried 1.20 too which had the same issue, but 1.21 is too new for manifests)
export INSTALL_K3S_VERSION="v1.19.16%2Bk3s1"
curl -sfL https://get.k3s.io | sh -

# Get Kustomize 3.2.0
cd /opt/
wget https://github.com/kubernetes-sigs/kustomize/releases/download/v3.2.0/kustomize_3.2.0_linux_amd64
chmod +x kustomize_3.2.0_linux_amd64
ln -s /opt/kustomize_3.2.0_linux_amd64 /usr/bin/kustomize

# Install Kubeflow using manifests
git clone https://github.com/kubeflow/manifests.git
cd manifests
while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done

# Portforward Kubeflow dashboard in new tmux session
tmux new -d -s kubeflow-dashboard-portforward "kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80"

kubectl get pods output

image

kubectl logs conditional-execution-pipeline-with-exit-handler-scjtr-3243716801 -c wait -n kubeflow-user-example-com

...
time="2022-01-01T15:31:50.462Z" level=info msg="listed containers" containers="map[]"
time="2022-01-01T15:31:51.462Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow-user-example-com --filter=label=io.kubernetes.pod.name=conditional-execution-pipeline-with-exit-handler-scjtr-3243716801"
time="2022-01-01T15:31:51.498Z" level=info msg="listed containers" containers="map[]"
time="2022-01-01T15:31:52.498Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow-user-example-com --filter=label=io.kubernetes.pod.name=conditional-execution-pipeline-with-exit-handler-scjtr-3243716801"
time="2022-01-01T15:31:52.525Z" level=info msg="listed containers" containers="map[]"
time="2022-01-01T15:31:53.525Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow-user-example-com --filter=label=io.kubernetes.pod.name=conditional-execution-pipeline-with-exit-handler-scjtr-3243716801"
... (keeps going)

Step Events tab

kind: EventList
apiVersion: v1
metadata:
  selfLink: /api/v1/namespaces/kubeflow-user-example-com/events
  resourceVersion: '27545'
items:
  - metadata:
      name: >-
        conditional-execution-pipeline-with-exit-handler-scjtr-3243716801.16c62dd37b36b69a
      namespace: kubeflow-user-example-com
      selfLink: >-
        /api/v1/namespaces/kubeflow-user-example-com/events/conditional-execution-pipeline-with-exit-handler-scjtr-3243716801.16c62dd37b36b69a
      uid: 76728849-0449-4142-a4a1-bf839192d0a2
      resourceVersion: '6986'
      creationTimestamp: '2022-01-01T15:05:00Z'
      managedFields:
        - manager: k3s
          operation: Update
          apiVersion: events.k8s.io/v1
          time: '2022-01-01T15:05:00Z'
          fieldsType: FieldsV1
          fieldsV1:
            'f:action': {}
            'f:eventTime': {}
            'f:note': {}
            'f:reason': {}
            'f:regarding':
              'f:apiVersion': {}
              'f:kind': {}
              'f:name': {}
              'f:namespace': {}
              'f:resourceVersion': {}
              'f:uid': {}
            'f:reportingController': {}
            'f:reportingInstance': {}
            'f:type': {}
    involvedObject:
      kind: Pod
      namespace: kubeflow-user-example-com
      name: conditional-execution-pipeline-with-exit-handler-scjtr-3243716801
      uid: 542e569f-178b-42eb-a7e7-d07ea643178d
      apiVersion: v1
      resourceVersion: '6983'
    reason: Scheduled
    message: >-
      Successfully assigned
      kubeflow-user-example-com/conditional-execution-pipeline-with-exit-handler-scjtr-3243716801
      to ubuntu-2gb-fsn1-2
    source: {}
    firstTimestamp: null
    lastTimestamp: null
    type: Normal
    eventTime: '2022-01-01T15:05:00.551652Z'
    action: Binding
    reportingComponent: default-scheduler
    reportingInstance: default-scheduler-ubuntu-2gb-fsn1-2
  - metadata:
      name: >-
        conditional-execution-pipeline-with-exit-handler-scjtr-3243716801.16c62dd39bb346f0
      namespace: kubeflow-user-example-com
      selfLink: >-
        /api/v1/namespaces/kubeflow-user-example-com/events/conditional-execution-pipeline-with-exit-handler-scjtr-3243716801.16c62dd39bb346f0
      uid: 72582b0d-a333-4284-8f04-d90e11547f28
      resourceVersion: '6997'
      creationTimestamp: '2022-01-01T15:05:01Z'
      managedFields:
        - manager: k3s
          operation: Update
          apiVersion: v1
          time: '2022-01-01T15:05:01Z'
          fieldsType: FieldsV1
          fieldsV1:
            'f:count': {}
            'f:firstTimestamp': {}
            'f:involvedObject':
              'f:apiVersion': {}
              'f:fieldPath': {}
              'f:kind': {}
              'f:name': {}
              'f:namespace': {}
              'f:resourceVersion': {}
              'f:uid': {}
            'f:lastTimestamp': {}
            'f:message': {}
            'f:reason': {}
            'f:source':
              'f:component': {}
              'f:host': {}
            'f:type': {}
    involvedObject:
      kind: Pod
      namespace: kubeflow-user-example-com
      name: conditional-execution-pipeline-with-exit-handler-scjtr-3243716801
      uid: 542e569f-178b-42eb-a7e7-d07ea643178d
      apiVersion: v1
      resourceVersion: '6984'
      fieldPath: 'spec.containers{wait}'
    reason: Pulling
    message: >-
      Pulling image
      "gcr.io/ml-pipeline/argoexec:v3.1.6-patch-license-compliance"
    source:
      component: kubelet
      host: ubuntu-2gb-fsn1-2
    firstTimestamp: '2022-01-01T15:05:01Z'
    lastTimestamp: '2022-01-01T15:05:01Z'
    count: 1
    type: Normal
    eventTime: null
    reportingComponent: ''
    reportingInstance: ''
  - metadata:
      name: >-
        conditional-execution-pipeline-with-exit-handler-scjtr-3243716801.16c62dd4e0da1c08
      namespace: kubeflow-user-example-com
      selfLink: >-
        /api/v1/namespaces/kubeflow-user-example-com/events/conditional-execution-pipeline-with-exit-handler-scjtr-3243716801.16c62dd4e0da1c08
      uid: 7ea0cdf6-001a-4432-8e74-bd5a1e80bfcb
      resourceVersion: '7122'
      creationTimestamp: '2022-01-01T15:05:06Z'
      managedFields:
        - manager: k3s
          operation: Update
          apiVersion: v1
          time: '2022-01-01T15:05:06Z'
          fieldsType: FieldsV1
          fieldsV1:
            'f:count': {}
            'f:firstTimestamp': {}
            'f:involvedObject':
              'f:apiVersion': {}
              'f:fieldPath': {}
              'f:kind': {}
              'f:name': {}
              'f:namespace': {}
              'f:resourceVersion': {}
              'f:uid': {}
            'f:lastTimestamp': {}
            'f:message': {}
            'f:reason': {}
            'f:source':
              'f:component': {}
              'f:host': {}
            'f:type': {}
    involvedObject:
      kind: Pod
      namespace: kubeflow-user-example-com
      name: conditional-execution-pipeline-with-exit-handler-scjtr-3243716801
      uid: 542e569f-178b-42eb-a7e7-d07ea643178d
      apiVersion: v1
      resourceVersion: '6984'
      fieldPath: 'spec.containers{wait}'
    reason: Pulled
    message: >-
      Successfully pulled image
      "gcr.io/ml-pipeline/argoexec:v3.1.6-patch-license-compliance" in
      5.455115342s
    source:
      component: kubelet
      host: ubuntu-2gb-fsn1-2
    firstTimestamp: '2022-01-01T15:05:06Z'
    lastTimestamp: '2022-01-01T15:05:06Z'
    count: 1
    type: Normal
    eventTime: null
    reportingComponent: ''
    reportingInstance: ''
  - metadata:
      name: >-
        conditional-execution-pipeline-with-exit-handler-scjtr-3243716801.16c62dd4f2b7ba3b
      namespace: kubeflow-user-example-com
      selfLink: >-
        /api/v1/namespaces/kubeflow-user-example-com/events/conditional-execution-pipeline-with-exit-handler-scjtr-3243716801.16c62dd4f2b7ba3b
      uid: b52af29f-96d3-4da0-900b-73fce909d9fe
      resourceVersion: '7123'
      creationTimestamp: '2022-01-01T15:05:06Z'
      managedFields:
        - manager: k3s
          operation: Update
          apiVersion: v1
          time: '2022-01-01T15:05:06Z'
          fieldsType: FieldsV1
          fieldsV1:
            'f:count': {}
            'f:firstTimestamp': {}
            'f:involvedObject':
              'f:apiVersion': {}
              'f:fieldPath': {}
              'f:kind': {}
              'f:name': {}
              'f:namespace': {}
              'f:resourceVersion': {}
              'f:uid': {}
            'f:lastTimestamp': {}
            'f:message': {}
            'f:reason': {}
            'f:source':
              'f:component': {}
              'f:host': {}
            'f:type': {}
    involvedObject:
      kind: Pod
      namespace: kubeflow-user-example-com
      name: conditional-execution-pipeline-with-exit-handler-scjtr-3243716801
      uid: 542e569f-178b-42eb-a7e7-d07ea643178d
      apiVersion: v1
      resourceVersion: '6984'
      fieldPath: 'spec.containers{wait}'
    reason: Created
    message: Created container wait
    source:
      component: kubelet
      host: ubuntu-2gb-fsn1-2
    firstTimestamp: '2022-01-01T15:05:06Z'
    lastTimestamp: '2022-01-01T15:05:06Z'
    count: 1
    type: Normal
    eventTime: null
    reportingComponent: ''
    reportingInstance: ''
  - metadata:
      name: >-
        conditional-execution-pipeline-with-exit-handler-scjtr-3243716801.16c62dd4f7efebb2
      namespace: kubeflow-user-example-com
      selfLink: >-
        /api/v1/namespaces/kubeflow-user-example-com/events/conditional-execution-pipeline-with-exit-handler-scjtr-3243716801.16c62dd4f7efebb2
      uid: 260352e6-ab20-4ae7-b522-0f00614d0e6b
      resourceVersion: '7128'
      creationTimestamp: '2022-01-01T15:05:06Z'
      managedFields:
        - manager: k3s
          operation: Update
          apiVersion: v1
          time: '2022-01-01T15:05:06Z'
          fieldsType: FieldsV1
          fieldsV1:
            'f:count': {}
            'f:firstTimestamp': {}
            'f:involvedObject':
              'f:apiVersion': {}
              'f:fieldPath': {}
              'f:kind': {}
              'f:name': {}
              'f:namespace': {}
              'f:resourceVersion': {}
              'f:uid': {}
            'f:lastTimestamp': {}
            'f:message': {}
            'f:reason': {}
            'f:source':
              'f:component': {}
              'f:host': {}
            'f:type': {}
    involvedObject:
      kind: Pod
      namespace: kubeflow-user-example-com
      name: conditional-execution-pipeline-with-exit-handler-scjtr-3243716801
      uid: 542e569f-178b-42eb-a7e7-d07ea643178d
      apiVersion: v1
      resourceVersion: '6984'
      fieldPath: 'spec.containers{wait}'
    reason: Started
    message: Started container wait
    source:
      component: kubelet
      host: ubuntu-2gb-fsn1-2
    firstTimestamp: '2022-01-01T15:05:06Z'
    lastTimestamp: '2022-01-01T15:05:06Z'
    count: 1
    type: Normal
    eventTime: null
    reportingComponent: ''
    reportingInstance: ''
  - metadata:
      name: >-
        conditional-execution-pipeline-with-exit-handler-scjtr-3243716801.16c62dd4f83995c4
      namespace: kubeflow-user-example-com
      selfLink: >-
        /api/v1/namespaces/kubeflow-user-example-com/events/conditional-execution-pipeline-with-exit-handler-scjtr-3243716801.16c62dd4f83995c4
      uid: eb424731-c7df-43c5-a08c-5da8ace8a81f
      resourceVersion: '7129'
      creationTimestamp: '2022-01-01T15:05:06Z'
      managedFields:
        - manager: k3s
          operation: Update
          apiVersion: v1
          time: '2022-01-01T15:05:06Z'
          fieldsType: FieldsV1
          fieldsV1:
            'f:count': {}
            'f:firstTimestamp': {}
            'f:involvedObject':
              'f:apiVersion': {}
              'f:fieldPath': {}
              'f:kind': {}
              'f:name': {}
              'f:namespace': {}
              'f:resourceVersion': {}
              'f:uid': {}
            'f:lastTimestamp': {}
            'f:message': {}
            'f:reason': {}
            'f:source':
              'f:component': {}
              'f:host': {}
            'f:type': {}
    involvedObject:
      kind: Pod
      namespace: kubeflow-user-example-com
      name: conditional-execution-pipeline-with-exit-handler-scjtr-3243716801
      uid: 542e569f-178b-42eb-a7e7-d07ea643178d
      apiVersion: v1
      resourceVersion: '6984'
      fieldPath: 'spec.containers{main}'
    reason: Pulled
    message: 'Container image "python:3.7" already present on machine'
    source:
      component: kubelet
      host: ubuntu-2gb-fsn1-2
    firstTimestamp: '2022-01-01T15:05:06Z'
    lastTimestamp: '2022-01-01T15:05:06Z'
    count: 1
    type: Normal
    eventTime: null
    reportingComponent: ''
    reportingInstance: ''
  - metadata:
      name: >-
        conditional-execution-pipeline-with-exit-handler-scjtr-3243716801.16c62dd4fad8f088
      namespace: kubeflow-user-example-com
      selfLink: >-
        /api/v1/namespaces/kubeflow-user-example-com/events/conditional-execution-pipeline-with-exit-handler-scjtr-3243716801.16c62dd4fad8f088
      uid: f22d88e1-0d43-4a99-b44e-f3a1d44cf524
      resourceVersion: '7130'
      creationTimestamp: '2022-01-01T15:05:06Z'
      managedFields:
        - manager: k3s
          operation: Update
          apiVersion: v1
          time: '2022-01-01T15:05:06Z'
          fieldsType: FieldsV1
          fieldsV1:
            'f:count': {}
            'f:firstTimestamp': {}
            'f:involvedObject':
              'f:apiVersion': {}
              'f:fieldPath': {}
              'f:kind': {}
              'f:name': {}
              'f:namespace': {}
              'f:resourceVersion': {}
              'f:uid': {}
            'f:lastTimestamp': {}
            'f:message': {}
            'f:reason': {}
            'f:source':
              'f:component': {}
              'f:host': {}
            'f:type': {}
    involvedObject:
      kind: Pod
      namespace: kubeflow-user-example-com
      name: conditional-execution-pipeline-with-exit-handler-scjtr-3243716801
      uid: 542e569f-178b-42eb-a7e7-d07ea643178d
      apiVersion: v1
      resourceVersion: '6984'
      fieldPath: 'spec.containers{main}'
    reason: Created
    message: Created container main
    source:
      component: kubelet
      host: ubuntu-2gb-fsn1-2
    firstTimestamp: '2022-01-01T15:05:06Z'
    lastTimestamp: '2022-01-01T15:05:06Z'
    count: 1
    type: Normal
    eventTime: null
    reportingComponent: ''
    reportingInstance: ''
  - metadata:
      name: >-
        conditional-execution-pipeline-with-exit-handler-scjtr-3243716801.16c62dd4ff8dd0c4
      namespace: kubeflow-user-example-com
      selfLink: >-
        /api/v1/namespaces/kubeflow-user-example-com/events/conditional-execution-pipeline-with-exit-handler-scjtr-3243716801.16c62dd4ff8dd0c4
      uid: 798f6d2a-d3eb-4d2d-b449-7eeb6d7c285c
      resourceVersion: '7133'
      creationTimestamp: '2022-01-01T15:05:07Z'
      managedFields:
        - manager: k3s
          operation: Update
          apiVersion: v1
          time: '2022-01-01T15:05:07Z'
          fieldsType: FieldsV1
          fieldsV1:
            'f:count': {}
            'f:firstTimestamp': {}
            'f:involvedObject':
              'f:apiVersion': {}
              'f:fieldPath': {}
              'f:kind': {}
              'f:name': {}
              'f:namespace': {}
              'f:resourceVersion': {}
              'f:uid': {}
            'f:lastTimestamp': {}
            'f:message': {}
            'f:reason': {}
            'f:source':
              'f:component': {}
              'f:host': {}
            'f:type': {}
    involvedObject:
      kind: Pod
      namespace: kubeflow-user-example-com
      name: conditional-execution-pipeline-with-exit-handler-scjtr-3243716801
      uid: 542e569f-178b-42eb-a7e7-d07ea643178d
      apiVersion: v1
      resourceVersion: '6984'
      fieldPath: 'spec.containers{main}'
    reason: Started
    message: Started container main
    source:
      component: kubelet
      host: ubuntu-2gb-fsn1-2
    firstTimestamp: '2022-01-01T15:05:07Z'
    lastTimestamp: '2022-01-01T15:05:07Z'
    count: 1
    type: Normal
    eventTime: null
    reportingComponent: ''
    reportingInstance: ''

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

RobinKa commented 2 years ago

Using the pns executor instead makes everything work as described here

kustomize build apps/pipeline/upstream/env/platform-agnostic-multi-user-pns | kubectl apply -f -

So I assume I made a mistake in my Docker setup? Although not much about docker is mentioned in the manifests readme.

zijianjoy commented 2 years ago

Hello @RobinKa , you can switch over to emissary executor since that is going to be the default executor going forward. https://github.com/kubeflow/pipelines/issues/5714

ketangangal commented 1 year ago

Even with proper emissary executor , sometimes component will stuck