kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.61k stars 1.62k forks source link

[backend] This step is in Error state with this message: task X errored: failed to resolve Y #10877

Open tomaszstachera opened 5 months ago

tomaszstachera commented 5 months ago

Environment

Steps to reproduce

Run below pipeline (kfp==1.8.21)

import kfp
from kubernetes.client.models.v1_toleration import V1Toleration

# to discard effect of taint on a node
toleration = V1Toleration(effect='NoSchedule', key='ComputeResources', value='reservedFor')

def sample_op(in_var: int) -> int:
    print(in_var)
    return in_var

sample_comp = kfp.components.func_to_container_op(
    func=sample_op,
    base_image='python:3.10-slim-buster',
)
@kfp.dsl.pipeline(
    name='ppln-from-vsc',
    description='A pipeline'
)
def ppln_from_vsc():
    ret = sample_comp(1234).set_memory_request('25Mi').set_memory_limit('100Mi').set_cpu_request('25m').set_cpu_limit('50m').add_toleration(toleration)
    sample_comp(ret.output).set_memory_request('25Mi').set_memory_limit('100Mi').set_cpu_request('25m').set_cpu_limit('50m').add_toleration(toleration)

client = kfp.Client()
resp = client.create_run_from_pipeline_func(
    ppln_from_vsc,
    arguments={},
    # namespace='tomasz'
)

Expected result

Pipeline should run and succeed. This suddenly started happening, worked before.

Materials and Reference

image

Logs:

ml-pipeline

I0607 14:08:09.856967       7 interceptor.go:37] /api.ReportService/ReportScheduledWorkflow handler finished
I0607 14:08:10.955394       7 interceptor.go:29] /api.ReportService/ReportScheduledWorkflow handler starting
I0607 14:08:10.964429       7 interceptor.go:37] /api.ReportService/ReportScheduledWorkflow handler finished
I0607 14:08:14.543407       7 interceptor.go:29] /api.RunService/ListRuns handler starting
I0607 14:08:14.543553       7 error.go:259] Invalid input error: ListRuns must filter by resource reference in multi-user mode.
github.com/kubeflow/pipelines/backend/src/common/util.NewInvalidInputError
        /go/src/github.com/kubeflow/pipelines/backend/src/common/util/error.go:174
github.com/kubeflow/pipelines/backend/src/apiserver/server.(*RunServer).ListRuns
        /go/src/github.com/kubeflow/pipelines/backend/src/apiserver/server/run_server.go:181
github.com/kubeflow/pipelines/backend/api/go_client._RunService_ListRuns_Handler.func1
        /go/src/github.com/kubeflow/pipelines/backend/api/go_client/run.pb.go:2198
main.apiServerInterceptor
        /go/src/github.com/kubeflow/pipelines/backend/src/apiserver/interceptor.go:30
github.com/kubeflow/pipelines/backend/api/go_client._RunService_ListRuns_Handler
        /go/src/github.com/kubeflow/pipelines/backend/api/go_client/run.pb.go:2200
google.golang.org/grpc.(*Server).processUnaryRPC
        /go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1282
google.golang.org/grpc.(*Server).handleStream
        /go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1616
google.golang.org/grpc.(*Server).serveStreams.func1.2
        /go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:921
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1581
/api.RunService/ListRuns call failed
github.com/kubeflow/pipelines/backend/src/common/util.(*UserError).wrapf
        /go/src/github.com/kubeflow/pipelines/backend/src/common/util/error.go:247
github.com/kubeflow/pipelines/backend/src/common/util.Wrapf
        /go/src/github.com/kubeflow/pipelines/backend/src/common/util/error.go:272
main.apiServerInterceptor
        /go/src/github.com/kubeflow/pipelines/backend/src/apiserver/interceptor.go:32
github.com/kubeflow/pipelines/backend/api/go_client._RunService_ListRuns_Handler
        /go/src/github.com/kubeflow/pipelines/backend/api/go_client/run.pb.go:2200
google.golang.org/grpc.(*Server).processUnaryRPC
        /go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1282
google.golang.org/grpc.(*Server).handleStream
        /go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1616
google.golang.org/grpc.(*Server).serveStreams.func1.2
        /go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:921
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1581
I0607 14:08:14.660473       7 interceptor.go:29] /api.RunService/ListRuns handler starting
I0607 14:08:14.660605       7 util.go:360] Getting user identity...
I0607 14:08:14.660659       7 util.go:370] User: tomasz@siemens-energy.com, ResourceAttributes: &ResourceAttributes{Namespace:tomasz,Verb:list,Group:pipelines.kubeflow.org,Version:v1beta1,Resource:runs,Subresource:,Name:,}
I0607 14:08:14.660708       7 util.go:371] Authorizing request...
I0607 14:08:14.664698       7 util.go:378] Authorized user 'tomasz@siemens-energy.com': &ResourceAttributes{Namespace:tomasz,Verb:list,Group:pipelines.kubeflow.org,Version:v1beta1,Resource:runs,Subresource:,Name:,}
I0607 14:08:14.675650       7 interceptor.go:37] /api.RunService/ListRuns handler finished
I0607 14:08:14.748592       7 interceptor.go:29] /api.PipelineService/GetPipelineVersion handler starting
I0607 14:08:14.754037       7 interceptor.go:37] /api.PipelineService/GetPipelineVersion handler finished

ml-pipeline-ui

GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /pipeline/
GET /pipeline/static/css/main.58abdd36.css
GET /pipeline/static/js/main.0ba5be1b.js
GET /pipeline/apis/v1beta1/runs?page_token=&page_size=10&sort_by=created_at%20desc&filter=%257B%2522predicates%2522%253A%255B%257B%2522key%2522%253A%2522storage_state%2522%252C%2522op%2522%253A%2522NOT_EQUALS%2522%252C%2522string_value%2522%253A%2522STORAGESTATE_ARCHIVED%2522%257D%255D%257D
Proxied request:  /apis/v1beta1/runs?page_token=&page_size=10&sort_by=created_at%20desc&filter=%257B%2522predicates%2522%253A%255B%257B%2522key%2522%253A%2522storage_state%2522%252C%2522op%2522%253A%2522NOT_EQUALS%2522%252C%2522string_value%2522%253A%2522STORAGESTATE_ARCHIVED%2522%257D%255D%257D
GET /pipeline/system/cluster-name
GET /pipeline/system/project-id
(node:1) UnhandledPromiseRejectionWarning: FetchError: request to http://metadata/computeMetadata/v1/project/project-id failed, reason: getaddrinfo ENOTFOUND metadata
    at ClientRequest.<anonymous> (/server/node_modules/node-fetch/lib/index.js:1491:11)
    at ClientRequest.emit (events.js:400:28)
    at Socket.socketErrorListener (_http_client.js:475:9)
    at Socket.emit (events.js:400:28)
    at emitErrorNT (internal/streams/destroy.js:106:8)
    at emitErrorCloseNT (internal/streams/destroy.js:74:3)
    at processTicksAndRejections (internal/process/task_queues.js:82:21)
(node:1) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 41)
(node:1) UnhandledPromiseRejectionWarning: FetchError: request to http://metadata/computeMetadata/v1/instance/attributes/cluster-name failed, reason: getaddrinfo ENOTFOUND metadata
    at ClientRequest.<anonymous> (/server/node_modules/node-fetch/lib/index.js:1491:11)
    at ClientRequest.emit (events.js:400:28)
    at Socket.socketErrorListener (_http_client.js:475:9)
    at Socket.emit (events.js:400:28)
    at emitErrorNT (internal/streams/destroy.js:106:8)
    at emitErrorCloseNT (internal/streams/destroy.js:74:3)
    at processTicksAndRejections (internal/process/task_queues.js:82:21)
(node:1) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 42)
GET /pipeline/apis/v1beta1/healthz
GET /pipeline/apis/v1beta1/runs?page_token=&page_size=10&sort_by=created_at%20desc&resource_reference_key.type=NAMESPACE&resource_reference_key.id=tomasz&filter=%257B%2522predicates%2522%253A%255B%257B%2522key%2522%253A%2522storage_state%2522%252C%2522op%2522%253A%2522NOT_EQUALS%2522%252C%2522string_value%2522%253A%2522STORAGESTATE_ARCHIVED%2522%257D%255D%257D
Proxied request:  /apis/v1beta1/runs?page_token=&page_size=10&sort_by=created_at%20desc&resource_reference_key.type=NAMESPACE&resource_reference_key.id=tomasz&filter=%257B%2522predicates%2522%253A%255B%257B%2522key%2522%253A%2522storage_state%2522%252C%2522op%2522%253A%2522NOT_EQUALS%2522%252C%2522string_value%2522%253A%2522STORAGESTATE_ARCHIVED%2522%257D%255D%257D
GET /pipeline/apis/v1beta1/pipeline_versions/5ff01b86-c9b9-45f1-8854-541f851c769c
Proxied request:  /apis/v1beta1/pipeline_versions/5ff01b86-c9b9-45f1-8854-541f851c769c
GET /pipeline/apis/v1beta1/pipeline_versions/5ff01b86-c9b9-45f1-8854-541f851c769c
Proxied request:  /apis/v1beta1/pipeline_versions/5ff01b86-c9b9-45f1-8854-541f851c769c
GET /pipeline/apis/v1beta1/pipeline_versions/5ff01b86-c9b9-45f1-8854-541f851c769c
Proxied request:  /apis/v1beta1/pipeline_versions/5ff01b86-c9b9-45f1-8854-541f851c769c
GET /pipeline/apis/v1beta1/pipeline_versions/5ff01b86-c9b9-45f1-8854-541f851c769c
Proxied request:  /apis/v1beta1/pipeline_versions/5ff01b86-c9b9-45f1-8854-541f851c769c
GET /pipeline/apis/v1beta1/pipeline_versions/5ff01b86-c9b9-45f1-8854-541f851c769c
Proxied request:  /apis/v1beta1/pipeline_versions/5ff01b86-c9b9-45f1-8854-541f851c769c
GET /pipeline/apis/v1beta1/pipeline_versions/5ff01b86-c9b9-45f1-8854-541f851c769c
Proxied request:  /apis/v1beta1/pipeline_versions/5ff01b86-c9b9-45f1-8854-541f851c769c
GET /pipeline/apis/v1beta1/pipeline_versions/5ff01b86-c9b9-45f1-8854-541f851c769c
Proxied request:  /apis/v1beta1/pipeline_versions/5ff01b86-c9b9-45f1-8854-541f851c769c
GET /pipeline/apis/v1beta1/pipeline_versions/5ff01b86-c9b9-45f1-8854-541f851c769c
Proxied request:  /apis/v1beta1/pipeline_versions/5ff01b86-c9b9-45f1-8854-541f851c769c
GET /pipeline/apis/v1beta1/pipeline_versions/5ff01b86-c9b9-45f1-8854-541f851c769c
Proxied request:  /apis/v1beta1/pipeline_versions/5ff01b86-c9b9-45f1-8854-541f851c769c
GET /pipeline/apis/v1beta1/runs/f26acc4b-0b26-4f9e-8da7-b65c53fbceef
Proxied request:  /apis/v1beta1/runs/f26acc4b-0b26-4f9e-8da7-b65c53fbceef
GET /pipeline/visualizations/allowed
GET /pipeline/apis/v1beta1/runs/f26acc4b-0b26-4f9e-8da7-b65c53fbceef
Proxied request:  /apis/v1beta1/runs/f26acc4b-0b26-4f9e-8da7-b65c53fbceef
GET /pipeline/apis/v1beta1/experiments/b5294163-5bba-4186-9ac2-41d05beb2661
Proxied request:  /apis/v1beta1/experiments/b5294163-5bba-4186-9ac2-41d05beb2661
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /pipeline/k8s/pod?podname=ppln-from-vsc-wx98j-1984707146&podnamespace=tomasz
Could not get pod ppln-from-vsc-wx98j-1984707146 in namespace tomasz: pods "ppln-from-vsc-wx98j-1984707146" not found {
  kind: 'Status',
  apiVersion: 'v1',
  metadata: {},
  status: 'Failure',
  message: 'pods "ppln-from-vsc-wx98j-1984707146" not found',
  reason: 'NotFound',
  details: { name: 'ppln-from-vsc-wx98j-1984707146', kind: 'pods' },
  code: 404
}
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /pipeline/k8s/pod/events?podname=ppln-from-vsc-wx98j-1984707146&podnamespace=tomasz
GET /pipeline/k8s/pod/logs?podname=ppln-from-vsc-wx98j-1984707146&runid=f26acc4b-0b26-4f9e-8da7-b65c53fbceef&podnamespace=tomasz
GET /apis/v1beta1/healthz

Impacted by this bug? Give it a 👍.

tomaszstachera commented 4 months ago

Any update here?

tomaszstachera commented 2 months ago

Which component is responsible for the resolve operation from error in the UI?

tomaszstachera commented 2 months ago

Root cause maybe lie in Argo as workflow-controller logs contain the message from UI:

workflow-controller time="2024-08-27T12:38:03.175Z" level=info msg="Processing workflow" namespace=tomasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.176Z" level=info msg="Task-result reconciliation" namespace=tomasz numObjs=0 workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.176Z" level=info msg="All of node ppln-from-vsc-xkhhr.sample-op dependencies [] completed" namespace=tomasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.177Z" level=info msg="Pod node ppln-from-vsc-xkhhr-1425665423 initialized Pending" namespace=tomasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=warning msg="Non-transient error: failed to resolve {{`ppln-from-vsc-xkhhr`}}"
workflow-controller time="2024-08-27T12:38:03.178Z" level=error msg="Mark error node" error="failed to resolve {{`ppln-from-vsc-xkhhr`}}" namespace=tomasz nodeName=ppln-from-vsc-xkhhr.sampl
e-op workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="node ppln-from-vsc-xkhhr-1425665423 phase Pending -> Error" namespace=tomasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="node ppln-from-vsc-xkhhr-1425665423 message: failed to resolve {{`ppln-from-vsc-xkhhr`}}" namespace=tomasz workflow=ppln-
from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="node ppln-from-vsc-xkhhr-1425665423 finished: 2024-08-27 12:38:03.178501556 +0000 UTC" namespace=tomasz workflow=ppln-fro
m-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=error msg="Mark error node" error="task 'ppln-from-vsc-xkhhr.sample-op' errored: failed to resolve {{`ppln-from-vsc-xkhhr`}}" names
pace=tomasz nodeName=ppln-from-vsc-xkhhr.sample-op workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="node ppln-from-vsc-xkhhr-1425665423 message: task 'ppln-from-vsc-xkhhr.sample-op' errored: failed to resolve {{`ppln-from
-vsc-xkhhr`}}" namespace=tomasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="Skipped node ppln-from-vsc-xkhhr-184939484 initialized Omitted (message: omitted: depends condition not met)" namespace=t
omasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="Outbound nodes of ppln-from-vsc-xkhhr set to [ppln-from-vsc-xkhhr-184939484]" namespace=tomasz workflow=ppln-from-vsc-xkh
hr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="node ppln-from-vsc-xkhhr phase Running -> Error" namespace=tomasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="node ppln-from-vsc-xkhhr finished: 2024-08-27 12:38:03.178728796 +0000 UTC" namespace=tomasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="Checking daemoned children of ppln-from-vsc-xkhhr" namespace=tomasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="TaskSet Reconciliation" namespace=tomasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg=reconcileAgentPod namespace=tomasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="Updated phase Running -> Error" namespace=tomasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="Marking workflow completed" namespace=tomasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="Checking daemoned children of " namespace=tomasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="Workflow to be dehydrated" Workflow Size=9953
workflow-controller time="2024-08-27T12:38:03.184Z" level=info msg="cleaning up pod" action=deletePod key=tomasz/ppln-from-vsc-xkhhr-1340600742-agent/deletePod
workflow-controller time="2024-08-27T12:38:03.188Z" level=info msg="Update workflows 200"
workflow-controller time="2024-08-27T12:38:03.189Z" level=info msg="Workflow update successful" namespace=tomasz phase=Error resourceVersion=724918529 workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.189Z" level=info msg="Queueing Error workflow tomasz/ppln-from-vsc-xkhhr for delete in 168h0m0s due to TTL"
workflow-controller time="2024-08-27T12:38:03.195Z" level=info msg="Delete pods 404"
workflow-controller time="2024-08-27T12:38:03.196Z" level=info msg="DeleteCollection workflowtaskresults 200"
workflow-controller time="2024-08-27T12:38:03.197Z" level=info msg="Patch events 200"
github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.