kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.59k stars 1.62k forks source link

[backend] This step is in Error state with this message: task X errored: failed to resolve Y #10877

Open tomaszstachera opened 3 months ago

tomaszstachera commented 3 months ago

Environment

Steps to reproduce

Run below pipeline (kfp==1.8.21)

import kfp
from kubernetes.client.models.v1_toleration import V1Toleration

# to discard effect of taint on a node
toleration = V1Toleration(effect='NoSchedule', key='ComputeResources', value='reservedFor')

def sample_op(in_var: int) -> int:
    print(in_var)
    return in_var

sample_comp = kfp.components.func_to_container_op(
    func=sample_op,
    base_image='python:3.10-slim-buster',
)
@kfp.dsl.pipeline(
    name='ppln-from-vsc',
    description='A pipeline'
)
def ppln_from_vsc():
    ret = sample_comp(1234).set_memory_request('25Mi').set_memory_limit('100Mi').set_cpu_request('25m').set_cpu_limit('50m').add_toleration(toleration)
    sample_comp(ret.output).set_memory_request('25Mi').set_memory_limit('100Mi').set_cpu_request('25m').set_cpu_limit('50m').add_toleration(toleration)

client = kfp.Client()
resp = client.create_run_from_pipeline_func(
    ppln_from_vsc,
    arguments={},
    # namespace='tomasz'
)

Expected result

Pipeline should run and succeed. This suddenly started happening, worked before.

Materials and Reference

image

Logs:

ml-pipeline

I0607 14:08:09.856967       7 interceptor.go:37] /api.ReportService/ReportScheduledWorkflow handler finished
I0607 14:08:10.955394       7 interceptor.go:29] /api.ReportService/ReportScheduledWorkflow handler starting
I0607 14:08:10.964429       7 interceptor.go:37] /api.ReportService/ReportScheduledWorkflow handler finished
I0607 14:08:14.543407       7 interceptor.go:29] /api.RunService/ListRuns handler starting
I0607 14:08:14.543553       7 error.go:259] Invalid input error: ListRuns must filter by resource reference in multi-user mode.
github.com/kubeflow/pipelines/backend/src/common/util.NewInvalidInputError
        /go/src/github.com/kubeflow/pipelines/backend/src/common/util/error.go:174
github.com/kubeflow/pipelines/backend/src/apiserver/server.(*RunServer).ListRuns
        /go/src/github.com/kubeflow/pipelines/backend/src/apiserver/server/run_server.go:181
github.com/kubeflow/pipelines/backend/api/go_client._RunService_ListRuns_Handler.func1
        /go/src/github.com/kubeflow/pipelines/backend/api/go_client/run.pb.go:2198
main.apiServerInterceptor
        /go/src/github.com/kubeflow/pipelines/backend/src/apiserver/interceptor.go:30
github.com/kubeflow/pipelines/backend/api/go_client._RunService_ListRuns_Handler
        /go/src/github.com/kubeflow/pipelines/backend/api/go_client/run.pb.go:2200
google.golang.org/grpc.(*Server).processUnaryRPC
        /go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1282
google.golang.org/grpc.(*Server).handleStream
        /go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1616
google.golang.org/grpc.(*Server).serveStreams.func1.2
        /go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:921
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1581
/api.RunService/ListRuns call failed
github.com/kubeflow/pipelines/backend/src/common/util.(*UserError).wrapf
        /go/src/github.com/kubeflow/pipelines/backend/src/common/util/error.go:247
github.com/kubeflow/pipelines/backend/src/common/util.Wrapf
        /go/src/github.com/kubeflow/pipelines/backend/src/common/util/error.go:272
main.apiServerInterceptor
        /go/src/github.com/kubeflow/pipelines/backend/src/apiserver/interceptor.go:32
github.com/kubeflow/pipelines/backend/api/go_client._RunService_ListRuns_Handler
        /go/src/github.com/kubeflow/pipelines/backend/api/go_client/run.pb.go:2200
google.golang.org/grpc.(*Server).processUnaryRPC
        /go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1282
google.golang.org/grpc.(*Server).handleStream
        /go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1616
google.golang.org/grpc.(*Server).serveStreams.func1.2
        /go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:921
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1581
I0607 14:08:14.660473       7 interceptor.go:29] /api.RunService/ListRuns handler starting
I0607 14:08:14.660605       7 util.go:360] Getting user identity...
I0607 14:08:14.660659       7 util.go:370] User: tomasz@siemens-energy.com, ResourceAttributes: &ResourceAttributes{Namespace:tomasz,Verb:list,Group:pipelines.kubeflow.org,Version:v1beta1,Resource:runs,Subresource:,Name:,}
I0607 14:08:14.660708       7 util.go:371] Authorizing request...
I0607 14:08:14.664698       7 util.go:378] Authorized user 'tomasz@siemens-energy.com': &ResourceAttributes{Namespace:tomasz,Verb:list,Group:pipelines.kubeflow.org,Version:v1beta1,Resource:runs,Subresource:,Name:,}
I0607 14:08:14.675650       7 interceptor.go:37] /api.RunService/ListRuns handler finished
I0607 14:08:14.748592       7 interceptor.go:29] /api.PipelineService/GetPipelineVersion handler starting
I0607 14:08:14.754037       7 interceptor.go:37] /api.PipelineService/GetPipelineVersion handler finished

ml-pipeline-ui

GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /pipeline/
GET /pipeline/static/css/main.58abdd36.css
GET /pipeline/static/js/main.0ba5be1b.js
GET /pipeline/apis/v1beta1/runs?page_token=&page_size=10&sort_by=created_at%20desc&filter=%257B%2522predicates%2522%253A%255B%257B%2522key%2522%253A%2522storage_state%2522%252C%2522op%2522%253A%2522NOT_EQUALS%2522%252C%2522string_value%2522%253A%2522STORAGESTATE_ARCHIVED%2522%257D%255D%257D
Proxied request:  /apis/v1beta1/runs?page_token=&page_size=10&sort_by=created_at%20desc&filter=%257B%2522predicates%2522%253A%255B%257B%2522key%2522%253A%2522storage_state%2522%252C%2522op%2522%253A%2522NOT_EQUALS%2522%252C%2522string_value%2522%253A%2522STORAGESTATE_ARCHIVED%2522%257D%255D%257D
GET /pipeline/system/cluster-name
GET /pipeline/system/project-id
(node:1) UnhandledPromiseRejectionWarning: FetchError: request to http://metadata/computeMetadata/v1/project/project-id failed, reason: getaddrinfo ENOTFOUND metadata
    at ClientRequest.<anonymous> (/server/node_modules/node-fetch/lib/index.js:1491:11)
    at ClientRequest.emit (events.js:400:28)
    at Socket.socketErrorListener (_http_client.js:475:9)
    at Socket.emit (events.js:400:28)
    at emitErrorNT (internal/streams/destroy.js:106:8)
    at emitErrorCloseNT (internal/streams/destroy.js:74:3)
    at processTicksAndRejections (internal/process/task_queues.js:82:21)
(node:1) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 41)
(node:1) UnhandledPromiseRejectionWarning: FetchError: request to http://metadata/computeMetadata/v1/instance/attributes/cluster-name failed, reason: getaddrinfo ENOTFOUND metadata
    at ClientRequest.<anonymous> (/server/node_modules/node-fetch/lib/index.js:1491:11)
    at ClientRequest.emit (events.js:400:28)
    at Socket.socketErrorListener (_http_client.js:475:9)
    at Socket.emit (events.js:400:28)
    at emitErrorNT (internal/streams/destroy.js:106:8)
    at emitErrorCloseNT (internal/streams/destroy.js:74:3)
    at processTicksAndRejections (internal/process/task_queues.js:82:21)
(node:1) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 42)
GET /pipeline/apis/v1beta1/healthz
GET /pipeline/apis/v1beta1/runs?page_token=&page_size=10&sort_by=created_at%20desc&resource_reference_key.type=NAMESPACE&resource_reference_key.id=tomasz&filter=%257B%2522predicates%2522%253A%255B%257B%2522key%2522%253A%2522storage_state%2522%252C%2522op%2522%253A%2522NOT_EQUALS%2522%252C%2522string_value%2522%253A%2522STORAGESTATE_ARCHIVED%2522%257D%255D%257D
Proxied request:  /apis/v1beta1/runs?page_token=&page_size=10&sort_by=created_at%20desc&resource_reference_key.type=NAMESPACE&resource_reference_key.id=tomasz&filter=%257B%2522predicates%2522%253A%255B%257B%2522key%2522%253A%2522storage_state%2522%252C%2522op%2522%253A%2522NOT_EQUALS%2522%252C%2522string_value%2522%253A%2522STORAGESTATE_ARCHIVED%2522%257D%255D%257D
GET /pipeline/apis/v1beta1/pipeline_versions/5ff01b86-c9b9-45f1-8854-541f851c769c
Proxied request:  /apis/v1beta1/pipeline_versions/5ff01b86-c9b9-45f1-8854-541f851c769c
GET /pipeline/apis/v1beta1/pipeline_versions/5ff01b86-c9b9-45f1-8854-541f851c769c
Proxied request:  /apis/v1beta1/pipeline_versions/5ff01b86-c9b9-45f1-8854-541f851c769c
GET /pipeline/apis/v1beta1/pipeline_versions/5ff01b86-c9b9-45f1-8854-541f851c769c
Proxied request:  /apis/v1beta1/pipeline_versions/5ff01b86-c9b9-45f1-8854-541f851c769c
GET /pipeline/apis/v1beta1/pipeline_versions/5ff01b86-c9b9-45f1-8854-541f851c769c
Proxied request:  /apis/v1beta1/pipeline_versions/5ff01b86-c9b9-45f1-8854-541f851c769c
GET /pipeline/apis/v1beta1/pipeline_versions/5ff01b86-c9b9-45f1-8854-541f851c769c
Proxied request:  /apis/v1beta1/pipeline_versions/5ff01b86-c9b9-45f1-8854-541f851c769c
GET /pipeline/apis/v1beta1/pipeline_versions/5ff01b86-c9b9-45f1-8854-541f851c769c
Proxied request:  /apis/v1beta1/pipeline_versions/5ff01b86-c9b9-45f1-8854-541f851c769c
GET /pipeline/apis/v1beta1/pipeline_versions/5ff01b86-c9b9-45f1-8854-541f851c769c
Proxied request:  /apis/v1beta1/pipeline_versions/5ff01b86-c9b9-45f1-8854-541f851c769c
GET /pipeline/apis/v1beta1/pipeline_versions/5ff01b86-c9b9-45f1-8854-541f851c769c
Proxied request:  /apis/v1beta1/pipeline_versions/5ff01b86-c9b9-45f1-8854-541f851c769c
GET /pipeline/apis/v1beta1/pipeline_versions/5ff01b86-c9b9-45f1-8854-541f851c769c
Proxied request:  /apis/v1beta1/pipeline_versions/5ff01b86-c9b9-45f1-8854-541f851c769c
GET /pipeline/apis/v1beta1/runs/f26acc4b-0b26-4f9e-8da7-b65c53fbceef
Proxied request:  /apis/v1beta1/runs/f26acc4b-0b26-4f9e-8da7-b65c53fbceef
GET /pipeline/visualizations/allowed
GET /pipeline/apis/v1beta1/runs/f26acc4b-0b26-4f9e-8da7-b65c53fbceef
Proxied request:  /apis/v1beta1/runs/f26acc4b-0b26-4f9e-8da7-b65c53fbceef
GET /pipeline/apis/v1beta1/experiments/b5294163-5bba-4186-9ac2-41d05beb2661
Proxied request:  /apis/v1beta1/experiments/b5294163-5bba-4186-9ac2-41d05beb2661
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /pipeline/k8s/pod?podname=ppln-from-vsc-wx98j-1984707146&podnamespace=tomasz
Could not get pod ppln-from-vsc-wx98j-1984707146 in namespace tomasz: pods "ppln-from-vsc-wx98j-1984707146" not found {
  kind: 'Status',
  apiVersion: 'v1',
  metadata: {},
  status: 'Failure',
  message: 'pods "ppln-from-vsc-wx98j-1984707146" not found',
  reason: 'NotFound',
  details: { name: 'ppln-from-vsc-wx98j-1984707146', kind: 'pods' },
  code: 404
}
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /pipeline/k8s/pod/events?podname=ppln-from-vsc-wx98j-1984707146&podnamespace=tomasz
GET /pipeline/k8s/pod/logs?podname=ppln-from-vsc-wx98j-1984707146&runid=f26acc4b-0b26-4f9e-8da7-b65c53fbceef&podnamespace=tomasz
GET /apis/v1beta1/healthz

Impacted by this bug? Give it a 👍.

tomaszstachera commented 3 months ago

Any update here?

tomaszstachera commented 1 month ago

Which component is responsible for the resolve operation from error in the UI?

tomaszstachera commented 1 month ago

Root cause maybe lie in Argo as workflow-controller logs contain the message from UI:

workflow-controller time="2024-08-27T12:38:03.175Z" level=info msg="Processing workflow" namespace=tomasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.176Z" level=info msg="Task-result reconciliation" namespace=tomasz numObjs=0 workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.176Z" level=info msg="All of node ppln-from-vsc-xkhhr.sample-op dependencies [] completed" namespace=tomasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.177Z" level=info msg="Pod node ppln-from-vsc-xkhhr-1425665423 initialized Pending" namespace=tomasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=warning msg="Non-transient error: failed to resolve {{`ppln-from-vsc-xkhhr`}}"
workflow-controller time="2024-08-27T12:38:03.178Z" level=error msg="Mark error node" error="failed to resolve {{`ppln-from-vsc-xkhhr`}}" namespace=tomasz nodeName=ppln-from-vsc-xkhhr.sampl
e-op workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="node ppln-from-vsc-xkhhr-1425665423 phase Pending -> Error" namespace=tomasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="node ppln-from-vsc-xkhhr-1425665423 message: failed to resolve {{`ppln-from-vsc-xkhhr`}}" namespace=tomasz workflow=ppln-
from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="node ppln-from-vsc-xkhhr-1425665423 finished: 2024-08-27 12:38:03.178501556 +0000 UTC" namespace=tomasz workflow=ppln-fro
m-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=error msg="Mark error node" error="task 'ppln-from-vsc-xkhhr.sample-op' errored: failed to resolve {{`ppln-from-vsc-xkhhr`}}" names
pace=tomasz nodeName=ppln-from-vsc-xkhhr.sample-op workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="node ppln-from-vsc-xkhhr-1425665423 message: task 'ppln-from-vsc-xkhhr.sample-op' errored: failed to resolve {{`ppln-from
-vsc-xkhhr`}}" namespace=tomasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="Skipped node ppln-from-vsc-xkhhr-184939484 initialized Omitted (message: omitted: depends condition not met)" namespace=t
omasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="Outbound nodes of ppln-from-vsc-xkhhr set to [ppln-from-vsc-xkhhr-184939484]" namespace=tomasz workflow=ppln-from-vsc-xkh
hr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="node ppln-from-vsc-xkhhr phase Running -> Error" namespace=tomasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="node ppln-from-vsc-xkhhr finished: 2024-08-27 12:38:03.178728796 +0000 UTC" namespace=tomasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="Checking daemoned children of ppln-from-vsc-xkhhr" namespace=tomasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="TaskSet Reconciliation" namespace=tomasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg=reconcileAgentPod namespace=tomasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="Updated phase Running -> Error" namespace=tomasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="Marking workflow completed" namespace=tomasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="Checking daemoned children of " namespace=tomasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="Workflow to be dehydrated" Workflow Size=9953
workflow-controller time="2024-08-27T12:38:03.184Z" level=info msg="cleaning up pod" action=deletePod key=tomasz/ppln-from-vsc-xkhhr-1340600742-agent/deletePod
workflow-controller time="2024-08-27T12:38:03.188Z" level=info msg="Update workflows 200"
workflow-controller time="2024-08-27T12:38:03.189Z" level=info msg="Workflow update successful" namespace=tomasz phase=Error resourceVersion=724918529 workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.189Z" level=info msg="Queueing Error workflow tomasz/ppln-from-vsc-xkhhr for delete in 168h0m0s due to TTL"
workflow-controller time="2024-08-27T12:38:03.195Z" level=info msg="Delete pods 404"
workflow-controller time="2024-08-27T12:38:03.196Z" level=info msg="DeleteCollection workflowtaskresults 200"
workflow-controller time="2024-08-27T12:38:03.197Z" level=info msg="Patch events 200"