kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.61k stars 1.63k forks source link

[backend] "cannot save parameter" for cached steps #10729

Open hbelmiro opened 6 months ago

hbelmiro commented 6 months ago

When running a simple V2 pipeline more than once the following errors happen:

time="2024-04-23T12:22:21.218Z" level=error msg="cannot save parameter /tmp/outputs/pod-spec-patch" argo=true error="open /tmp/outputs/pod-spec-patch: no such file or directory"
time="2024-04-23T12:22:21.218Z" level=info msg="/tmp/outputs/cached-decision -> /var/run/argo/outputs/parameters//tmp/outputs/cached-decision" argo=true
time="2024-04-23T12:22:21.218Z" level=error msg="cannot save parameter /tmp/outputs/condition" argo=true error="open /tmp/outputs/condition: no such file or directory"

Pipeline sample:

# PIPELINE DEFINITION
# Name: hello-pipeline
# Inputs:
#    recipient: str
# Outputs:
#    Output: str
components:
  comp-say-hello:
    executorLabel: exec-say-hello
    inputDefinitions:
      parameters:
        name:
          parameterType: STRING
    outputDefinitions:
      parameters:
        Output:
          parameterType: STRING
deploymentSpec:
  executors:
    exec-say-hello:
      container:
        args:
        - --executor_input
        - '{{$}}'
        - --function_to_execute
        - say_hello
        command:
        - sh
        - -c
        - "\nif ! [ -x \"$(command -v pip)\" ]; then\n    python3 -m ensurepip ||\
          \ python3 -m ensurepip --user || apt-get install python3-pip\nfi\n\nPIP_DISABLE_PIP_VERSION_CHECK=1\
          \ python3 -m pip install --quiet --no-warn-script-location 'kfp==2.7.0'\
          \ '--no-deps' 'typing-extensions>=3.7.4,<5; python_version<\"3.9\"' && \"\
          $0\" \"$@\"\n"
        - sh
        - -ec
        - 'program_path=$(mktemp -d)

          printf "%s" "$0" > "$program_path/ephemeral_component.py"

          _KFP_RUNTIME=true python3 -m kfp.dsl.executor_main                         --component_module_path                         "$program_path/ephemeral_component.py"                         "$@"

          '
        - "\nimport kfp\nfrom kfp import dsl\nfrom kfp.dsl import *\nfrom typing import\
          \ *\n\ndef say_hello(name: str) -> str:\n    hello_text = f'Hello, {name}!'\n\
          \    print(hello_text)\n    return hello_text\n\n"
        image: python:3.7
pipelineInfo:
  name: hello-pipeline
root:
  dag:
    outputs:
      parameters:
        Output:
          valueFromParameter:
            outputParameterKey: Output
            producerSubtask: say-hello
    tasks:
      say-hello:
        cachingOptions:
          enableCache: true
        componentRef:
          name: comp-say-hello
        inputs:
          parameters:
            name:
              componentInputParameter: recipient
        taskInfo:
          name: say-hello
  inputDefinitions:
    parameters:
      recipient:
        parameterType: STRING
  outputDefinitions:
    parameters:
      Output:
        parameterType: STRING
schemaVersion: 2.1.0
sdkVersion: kfp-2.7.0

This is related to https://github.com/kubeflow/pipelines/issues/9678#issuecomment-2071361425.

Impacted by this bug? Give it a 👍.

hbelmiro commented 6 months ago

/assign @hbelmiro

leanaha commented 5 months ago

Hi @hbelmiro, any update on this?

I bumped my company pipelines to make them compliant with KFP v2 and they are throwing these errors:

time="2024-06-07T17:29:06.435Z" level=info msg="sub-process exited" argo=true error="<nil>"
time="2024-06-07T17:29:06.436Z" level=error msg="cannot save parameter /tmp/outputs/pod-spec-patch" argo=true error="open /tmp/outputs/pod-spec-patch: no such file or directory"
time="2024-06-07T17:29:06.436Z" level=error msg="cannot save parameter /tmp/outputs/cached-decision" argo=true error="open /tmp/outputs/cached-decision: no such file or directory"
time="2024-06-07T17:29:06.436Z" level=error msg="cannot save parameter /tmp/outputs/condition" argo=true error="open /tmp/outputs/condition: no such file or directory"
hbelmiro commented 5 months ago

Hi @leanaha. I still didn't have time to work on it. Feel free to send a PR if you know how to fix it. I can help with the review.

/unassign @hbelmiro

github-actions[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

AndersBennedsgaard commented 3 months ago

Still relevant

hbelmiro commented 3 months ago

/lifecycle frozen /remove-lifecycle stale

lost-io commented 2 months ago

(Potential solve) may not be relevant.

We had similar issue in our cluster, based on Rancher Kubernetes engine 2. The issue where not Kubeflow pipelines itself, but the pipeline container not being able to communicate with the ml-pipeline controller. Due to network/network policies.

Applied something like this for the given Kubeflow profile namespace.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-egress-to-ml-pipeline-controller
  namespace: profile-namespace
spec:
  policyTypes:
    - Egress
  egress:
    - ports:
      - port: 8887
        protocol: TCP
      to:
      - namespaceSelector:
          matchLabels:
            kubernetes.io/metadata.name: kubeflow
      - podSelector:
          matchLabels:
            app: ml-pipeline
            app.kubernetes.io/name: kubeflow-pipelines

This may not be fine grained enough, but you get the idea.


Running recurring pipeline of say hello example:

Without networkPolicy

time="2024-08-16T10:38:06.360Z" level=info msg="sub-process exited" argo=true error="<nil>"
time="2024-08-16T10:38:06.360Z" level=error msg="cannot save parameter /tmp/outputs/pod-spec-patch" argo=true error="open /tmp/outputs/pod-spec-patch: no such file or directory"
time="2024-08-16T10:38:06.360Z" level=error msg="cannot save parameter /tmp/outputs/cached-decision" argo=true error="open /tmp/outputs/cached-decision: no such file or directory"
time="2024-08-16T10:38:06.360Z" level=info msg="/tmp/outputs/condition -> /var/run/argo/outputs/parameters//tmp/outputs/condition" argo=true
Error: exit status 1

With networkPolicy

time="2024-08-16T10:39:46.856Z" level=info msg="sub-process exited" argo=true error="<nil>"
time="2024-08-16T10:39:46.856Z" level=error msg="cannot save parameter /tmp/outputs/pod-spec-patch" argo=true error="open /tmp/outputs/pod-spec-patch: no such file or directory"
time="2024-08-16T10:39:46.856Z" level=info msg="/tmp/outputs/cached-decision -> /var/run/argo/outputs/parameters//tmp/outputs/cached-decision" argo=true
time="2024-08-16T10:39:46.856Z" level=info msg="/tmp/outputs/condition -> /var/run/argo/outputs/parameters//tmp/outputs/condition" argo=true

Hope this solves the issue, for others.

hbelmiro commented 2 months ago

/assign