kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.63k stars 1.64k forks source link

[backend] Panic while connection to default cache endpoint ml-pipeline.kubeflow:8887 #9702

Open andre-lx opened 1 year ago

andre-lx commented 1 year ago

Environment

Steps to reproduce

Hello, we are trying the migration from pipelines 1.8.5 to 2.0.0 but after the apply we are aheving some issues.

Running the "hello world" example from the jupyerlab:

from kfp import dsl
import kfp

from kfp import dsl

@dsl.component
def say_hello(name: str) -> str:
    hello_text = f'Hello, {name}!'
    print(hello_text)
    return hello_text

@dsl.pipeline
def hello_pipeline(recipient: str) -> str:
    hello_task = say_hello(name=recipient)
    return hello_task.output

from kfp import compiler

compiler.Compiler().compile(hello_pipeline, 'pipeline.yaml')

from kfp.client import Client

client = Client()
run = client.create_run_from_pipeline_package(
    'pipeline.yaml',
    arguments={
        'recipient': 'World',
    },
)

Or running the generated pipeline.yaml from the result directly though the UI, we always get the following error on the third pod that is started:

time="2023-07-05T14:19:23.912Z" level=info msg="capturing logs" argo=true
time="2023-07-05T14:19:23.945Z" level=info msg="capturing logs" argo=true
I0705 14:19:23.966873      51 launcher_v2.go:90] input ComponentSpec:{
  "inputDefinitions": {
    "parameters": {
      "name": {
        "parameterType": "STRING"
      }
    }
  },
  "outputDefinitions": {
    "parameters": {
      "Output": {
        "parameterType": "STRING"
      }
    }
  },
  "executorLabel": "exec-say-hello"
}
I0705 14:19:23.967498      51 cache.go:139] Cannot detect ml-pipeline in the same namespace, default to ml-pipeline.kubeflow:8887 as KFP endpoint.
I0705 14:19:23.967512      51 cache.go:116] Connecting to cache endpoint ml-pipeline.kubeflow:8887
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x941c29]

goroutine 1 [running]:
github.com/kubeflow/pipelines/backend/src/v2/metadata.(*Client).PublishExecution(0xc000b29920, {0x20a4878, 0xc000058040}, 0x0, 0x0, {0x0, 0x0, 0xc000b60000?}, 0x4)
    /go/src/github.com/kubeflow/pipelines/backend/src/v2/metadata/client.go:388 +0x69
github.com/kubeflow/pipelines/backend/src/v2/component.(*LauncherV2).publish(0x1d3c167?, {0x20a4878?, 0xc000058040?}, 0x1?, 0x1?, {0x0?, 0x1a51660?, 0xc0006a63a0?}, 0xc73bb0?)
    /go/src/github.com/kubeflow/pipelines/backend/src/v2/component/launcher_v2.go:266 +0x9b
github.com/kubeflow/pipelines/backend/src/v2/component.(*LauncherV2).Execute.func2()
    /go/src/github.com/kubeflow/pipelines/backend/src/v2/component/launcher_v2.go:144 +0x65
github.com/kubeflow/pipelines/backend/src/v2/component.(*LauncherV2).Execute(0xc00028e540, {0x20a4878, 0xc000058040})
    /go/src/github.com/kubeflow/pipelines/backend/src/v2/component/launcher_v2.go:156 +0x91e
main.run()
    /go/src/github.com/kubeflow/pipelines/backend/src/v2/cmd/launcher-v2/main.go:98 +0x3ed
main.main()
    /go/src/github.com/kubeflow/pipelines/backend/src/v2/cmd/launcher-v2/main.go:47 +0x19
time="2023-07-05T14:19:24.950Z" level=info msg="sub-process exited" argo=true error="<nil>"
Error: exit status 2
time="2023-07-05T14:19:25.918Z" level=info msg="sub-process exited" argo=true error="<nil>"
Error: exit status 2

The service ml-pipeline.kubeflow:8887 exists.

Everything works great on version 1.8.5.

If you need the logs from the others two pods please let me know. I also check the logs in all the kubeflow services and I can't find any issue.

Impacted by this bug? Give it a 👍.

zijianjoy commented 1 year ago

/assign @Linchin

Linchin commented 1 year ago

Hi @andre-lx, thank you for bringing up this issue. I tried the same pipeline on a newly deployed 2.0.0 cluster, and the run finished without issue. looking at the log you provided, we have

github.com/kubeflow/pipelines/backend/src/v2/metadata.(*Client).PublishExecution(0xc000b29920, {0x20a4878, 0xc000058040}, 0x0, 0x0, {0x0, 0x0, 0xc000b60000?}, 0x4) /go/src/github.com/kubeflow/pipelines/backend/src/v2/metadata/client.go:388 +0x69

The metadata client seems to come from version 2.0.0-rc.2 instead of version 2.0.0. Could you double check if you applied the manifest of version 2.0.0? Try apply the manifest again (here) and see if the issue persists.

Linchin commented 1 year ago

Also, could you let me know which way you used to deploy KFP, standalone or via kubeflow?

andre-lx commented 1 year ago

Hi @Linchin, I just checked and we are using the following image: https://github.com/kubeflow/pipelines/blob/e03e31219387b587b700ba3e31a02df486aa364f/manifests/kustomize/base/metadata/base/kustomization.yaml#L10-L12

The deployment was done using the follwing file: https://github.com/kubeflow/pipelines/blob/2.0.0/manifests/kustomize/env/platform-agnostic-multi-user/kustomization.yaml

Thanks

nithin8702 commented 1 year ago

Hi @andre-lx @Linchin Same issue we are also facing. Did you get a chance to fix it?

andre-lx commented 1 year ago

Hi @andre-lx @Linchin Same issue we are also facing. Did you get a chance to fix it?

I had to revert it to 1.8.5 for now.

nithin8702 commented 1 year ago
halilagin commented 1 year ago

I have the same error. Here are the details.

  1. Running in standalone mode
  2. Running in virtual cluster (everything is working but cannot run pipelines)
  3. All pods are working
  4. I can upload and run pipelines on UI, but the pod is failing
  5. Using the pipelines version 2.0.0
  6. Generating the pipeline with the command below kfp dsl compile --py v2/hello_world.py --output hello_world.pipeline.json
github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

pffijt commented 12 months ago

I also have this issue in my Kubeflow 1.8 environment. Kubeflow 1.8 is using the pipelines backend 2.0.3

I released my environment with the kubeflow manifest 1.8.

Can someone fix this issue?

taiynlee commented 7 months ago

the same issue on kubeflow 1.8

svn123 commented 6 months ago

I have faced a similar issue. I have full Kubeflow 1.8 environment installed and the pipeline backend metadata envoy is 2.0.3 version. Is this issue resolved?

umka1332 commented 6 months ago

I've faced similar issue, and it was due to proxy setting on the pod/step. After removing proxy setting the issue was gone.

pschoen-itsc commented 5 months ago

@umka1332 This solved the problem for me also. But do you know a way how I can still set proxy env vars to connect to the internet?

pschoen-itsc commented 5 months ago

Just tested successfully that setting NO_PROXY to '*.kubeflow,*.local' seems to work together with http(s)_proxy. It makes sense that the connection to ml-pipeline fails without NO_PROXY because then all traffic will be routed through the given proxy. It is just strange that it has seemed to work before updating kubeflow.

gregsheremeta commented 3 months ago

If anyone following this can reliably reproduce this issue...

we always get the following error on the third pod that is started

I also need to see the log on the second pod (driver) that is started. Thanks.

suanshs commented 3 months ago

@umka1332 This solved the problem for me also. But do you know a way how I can still set proxy env vars to connect to the internet?

Just tested successfully that setting NO_PROXY to '.kubeflow,.local' seems to work together with http(s)_proxy. It makes sense that the connection to ml-pipeline fails without NO_PROXY because then all traffic will be routed through the given proxy. It is just strange that it has seemed to work before updating kubeflow.

How did you solve this? I tried to set the no_proxy environment variables but it did not work for me. @umka1332

pschoen-itsc commented 3 months ago

@umka1332 This solved the problem for me also. But do you know a way how I can still set proxy env vars to connect to the internet?

Just tested successfully that setting NOPROXY to '.kubeflow,_.local' seems to work together with http(s)_proxy. It makes sense that the connection to ml-pipeline fails without NO_PROXY because then all traffic will be routed through the given proxy. It is just strange that it has seemed to work before updating kubeflow.

How did you solve this? I tried to set the no_proxy environment variables but it did not work for me. @umka1332

Important is to set NO_PROXY (so all uppercase). Also I had to add the kube api-server IP to NO_PROXY.

stevenkitter commented 3 months ago

1.8.1 kubeflow has the same problem....

stevenkitter commented 3 months ago

I solved this problem by delete proxy, you guys must delete proxy, if you need packages you need make a image that you can use.

suanshs commented 3 months ago
from kfp import dsl
from kfp import compiler

@dsl.component()
def say_hello() :
    import time
    time.sleep(1900)
    hello_text = f'Hello!'
    print(hello_text)

@dsl.pipeline
def hello_pipeline():
    hello_task = say_hello()
    hello_task.set_env_variable(name='NO_PROXY', value='*.kubeflow,*.local')
    hello_task.set_env_variable(name='no_proxy', value='*.kubeflow,*.local')
    hello_task.set_caching_options(False)

compiler.Compiler().compile(hello_pipeline, package_path='pipeline.yaml')

I tried running this but it did not work for me. Is there somethin I am missing here. @pschoen-itsc @umka1332

pschoen-itsc commented 3 months ago

@suanshs Seems like you are having a different problem. If you don't have any proxies set to begin with, then you also should not need the NO_PROXY settings. Can you provide logs of all the containers of the failing pod?

suanshs commented 3 months ago

@pschoen-itsc Following are the logs from main container of the failing pod

time="2024-08-28T14:19:16.866Z" level=info msg="capturing logs" argo=true
time="2024-08-28T14:19:16.900Z" level=info msg="capturing logs" argo=true
I0828 14:19:16.922099      53 launcher_v2.go:90] input ComponentSpec:{
  "executorLabel": "exec-say-hello"
}
I0828 14:19:16.922671      53 cache.go:116] Connecting to cache endpoint ml-pipeline.kubeflow:8887
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x941c29]

goroutine 1 [running]:
github.com/kubeflow/pipelines/backend/src/v2/metadata.(*Client).PublishExecution(0xc000afc720, {0x20a4878, 0xc000196000}, 0x0, 0x0, {0x0, 0x0, 0xc0004dc000?}, 0x4)
    /go/src/github.com/kubeflow/pipelines/backend/src/v2/metadata/client.go:388 +0x69
github.com/kubeflow/pipelines/backend/src/v2/component.(*LauncherV2).publish(0x467387?, {0x20a4878?, 0xc000196000?}, 0x1?, 0x1?, {0x0?, 0x1a51660?, 0xc0004c6060?}, 0xbbfbb0?)
    /go/src/github.com/kubeflow/pipelines/backend/src/v2/component/launcher_v2.go:266 +0x9b
github.com/kubeflow/pipelines/backend/src/v2/component.(*LauncherV2).Execute.func2()
    /go/src/github.com/kubeflow/pipelines/backend/src/v2/component/launcher_v2.go:144 +0x65
github.com/kubeflow/pipelines/backend/src/v2/component.(*LauncherV2).Execute(0xc000306460, {0x20a4878, 0xc000196000})
    /go/src/github.com/kubeflow/pipelines/backend/src/v2/component/launcher_v2.go:156 +0x91e
main.run()
    /go/src/github.com/kubeflow/pipelines/backend/src/v2/cmd/launcher-v2/main.go:98 +0x3ed
main.main()
    /go/src/github.com/kubeflow/pipelines/backend/src/v2/cmd/launcher-v2/main.go:47 +0x19
time="2024-08-28T14:19:17.903Z" level=info msg="sub-process exited" argo=true error="<nil>"
Error: exit status 2
time="2024-08-28T14:19:18.871Z" level=info msg="sub-process exited" argo=true error="<nil>"
Error: exit status 2

Following are the logs from wait container

time="2024-08-28T14:19:16.138Z" level=info msg="Starting Workflow Executor" executorType=emissary version=v3.3.10
time="2024-08-28T14:19:16.141Z" level=info msg="Creating a emissary executor"
time="2024-08-28T14:19:16.141Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2024-08-28T14:19:16.141Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=kubeflow podName=hello-pipeline-2clrb-1334336905 template="{\"name\":\"system-container-impl\",\"inputs\":{\"parameters\":[{\"name\":\"pod-spec-patch\",\"value\":\"{\\\"containers\\\":[{\\\"name\\\":\\\"main\\\",\\\"image\\\":\\\"docker-dev-artifactory.workday.com/ml/kubeflow/python-3.7:latest\\\",\\\"command\\\":[\\\"/var/run/argo/argoexec\\\",\\\"emissary\\\",\\\"--\\\",\\\"/kfp-launcher/launch\\\",\\\"--pipeline_name\\\",\\\"hello-pipeline\\\",\\\"--run_id\\\",\\\"5610709d-50b9-4833-8e2d-7e72a19a97ec\\\",\\\"--execution_id\\\",\\\"91\\\",\\\"--executor_input\\\",\\\"{\\\\\\\"inputs\\\\\\\":{},\\\\\\\"outputs\\\\\\\":{\\\\\\\"outputFile\\\\\\\":\\\\\\\"/tmp/kfp_outputs/output_metadata.json\\\\\\\"}}\\\",\\\"--component_spec\\\",\\\"{\\\\\\\"executorLabel\\\\\\\":\\\\\\\"exec-say-hello\\\\\\\"}\\\",\\\"--pod_name\\\",\\\"$(KFP_POD_NAME)\\\",\\\"--pod_uid\\\",\\\"$(KFP_POD_UID)\\\",\\\"--mlmd_server_address\\\",\\\"$(METADATA_GRPC_SERVICE_HOST)\\\",\\\"--mlmd_server_port\\\",\\\"tcp://10.100.242.77:8080\\\",\\\"--\\\"],\\\"args\\\":[\\\"sh\\\",\\\"-c\\\",\\\"\\\\nif ! [ -x \\\\\\\"$(command -v pip)\\\\\\\" ]; then\\\\n    python3 -m ensurepip || python3 -m ensurepip --user || apt-get install python3-pip\\\\nfi\\\\n\\\\nPIP_DISABLE_PIP_VERSION_CHECK=1 python3 -m pip install --quiet     --no-warn-script-location 'kfp==2.0.1' \\\\u0026\\\\u0026 \\\\\\\"$0\\\\\\\" \\\\\\\"$@\\\\\\\"\\\\n\\\",\\\"sh\\\",\\\"-ec\\\",\\\"program_path=$(mktemp -d)\\\\nprintf \\\\\\\"%s\\\\\\\" \\\\\\\"$0\\\\\\\" \\\\u003e \\\\\\\"$program_path/ephemeral_component.py\\\\\\\"\\\\npython3 -m kfp.components.executor_main                         --component_module_path                         \\\\\\\"$program_path/ephemeral_component.py\\\\\\\"                         \\\\\\\"$@\\\\\\\"\\\\n\\\",\\\"\\\\nimport kfp\\\\nfrom kfp import dsl\\\\nfrom kfp.dsl import *\\\\nfrom typing import *\\\\n\\\\ndef say_hello() :\\\\n    import time\\\\n    time.sleep(1900)\\\\n    hello_text = f'Hello, Suansh!'\\\\n    print(hello_text)\\\\n\\\\n\\\",\\\"--executor_input\\\",\\\"{{$}}\\\",\\\"--function_to_execute\\\",\\\"say_hello\\\"],\\\"env\\\":[{\\\"name\\\":\\\"NO_PROXY\\\",\\\"value\\\":\\\"172.17.68.189,.kubeflow,.local\\\"},{\\\"name\\\":\\\"no_proxy\\\",\\\"value\\\":\\\"172.17.68.189,.kubeflow,.local\\\"}],\\\"resources\\\":{}}]}\"}]},\"outputs\":{},\"metadata\":{\"annotations\":{\"sidecar.istio.io/inject\":\"false\"}},\"container\":{\"name\":\"\",\"image\":\"gcr.io/ml-pipeline/should-be-overridden-during-runtime\",\"command\":[\"should-be-overridden-during-runtime\"],\"envFrom\":[{\"configMapRef\":{\"name\":\"metadata-grpc-configmap\",\"optional\":true}}],\"env\":[{\"name\":\"KFP_POD_NAME\",\"valueFrom\":{\"fieldRef\":{\"fieldPath\":\"metadata.name\"}}},{\"name\":\"KFP_POD_UID\",\"valueFrom\":{\"fieldRef\":{\"fieldPath\":\"metadata.uid\"}}}],\"resources\":{},\"volumeMounts\":[{\"name\":\"kfp-launcher\",\"mountPath\":\"/kfp-launcher\"}]},\"volumes\":[{\"name\":\"kfp-launcher\",\"emptyDir\":{}}],\"initContainers\":[{\"name\":\"kfp-launcher\",\"image\":\"gcr.io/ml-pipeline/kfp-launcher@sha256:80cf120abd125db84fa547640fd6386c4b2a26936e0c2b04a7d3634991a850a4\",\"command\":[\"launcher-v2\",\"--copy\",\"/kfp-launcher/launch\"],\"resources\":{\"limits\":{\"cpu\":\"500m\",\"memory\":\"128Mi\"},\"requests\":{\"cpu\":\"100m\"}},\"volumeMounts\":[{\"name\":\"kfp-launcher\",\"mountPath\":\"/kfp-launcher\"}]}],\"archiveLocation\":{\"archiveLogs\":true,\"s3\":{\"endpoint\":\"minio.kubeflow:9000\",\"bucket\":\"mlpipeline\",\"insecure\":true,\"accessKeySecret\":{\"name\":\"mlpipeline-minio-artifact\",\"key\":\"accesskey\"},\"secretKeySecret\":{\"name\":\"mlpipeline-minio-artifact\",\"key\":\"secretkey\"},\"key\":\"artifacts/kubeflow/hello-pipeline-2clrb/2024-08-28/hello-pipeline-2clrb-1334336905\"}},\"podSpecPatch\":\"{\\\"containers\\\":[{\\\"name\\\":\\\"main\\\",\\\"image\\\":\\\"docker-dev-artifactory.workday.com/ml/kubeflow/python-3.7:latest\\\",\\\"command\\\":[\\\"/var/run/argo/argoexec\\\",\\\"emissary\\\",\\\"--\\\",\\\"/kfp-launcher/launch\\\",\\\"--pipeline_name\\\",\\\"hello-pipeline\\\",\\\"--run_id\\\",\\\"5610709d-50b9-4833-8e2d-7e72a19a97ec\\\",\\\"--execution_id\\\",\\\"91\\\",\\\"--executor_input\\\",\\\"{\\\\\\\"inputs\\\\\\\":{},\\\\\\\"outputs\\\\\\\":{\\\\\\\"outputFile\\\\\\\":\\\\\\\"/tmp/kfp_outputs/output_metadata.json\\\\\\\"}}\\\",\\\"--component_spec\\\",\\\"{\\\\\\\"executorLabel\\\\\\\":\\\\\\\"exec-say-hello\\\\\\\"}\\\",\\\"--pod_name\\\",\\\"$(KFP_POD_NAME)\\\",\\\"--pod_uid\\\",\\\"$(KFP_POD_UID)\\\",\\\"--mlmd_server_address\\\",\\\"$(METADATA_GRPC_SERVICE_HOST)\\\",\\\"--mlmd_server_port\\\",\\\"tcp://10.100.242.77:8080\\\",\\\"--\\\"],\\\"args\\\":[\\\"sh\\\",\\\"-c\\\",\\\"\\\\nif ! [ -x \\\\\\\"$(command -v pip)\\\\\\\" ]; then\\\\n    python3 -m ensurepip || python3 -m ensurepip --user || apt-get install python3-pip\\\\nfi\\\\n\\\\nPIP_DISABLE_PIP_VERSION_CHECK=1 python3 -m pip install --quiet     --no-warn-script-location 'kfp==2.0.1' \\\\u0026\\\\u0026 \\\\\\\"$0\\\\\\\" \\\\\\\"$@\\\\\\\"\\\\n\\\",\\\"sh\\\",\\\"-ec\\\",\\\"program_path=$(mktemp -d)\\\\nprintf \\\\\\\"%s\\\\\\\" \\\\\\\"$0\\\\\\\" \\\\u003e \\\\\\\"$program_path/ephemeral_component.py\\\\\\\"\\\\npython3 -m kfp.components.executor_main                         --component_module_path                         \\\\\\\"$program_path/ephemeral_component.py\\\\\\\"                         \\\\\\\"$@\\\\\\\"\\\\n\\\",\\\"\\\\nimport kfp\\\\nfrom kfp import dsl\\\\nfrom kfp.dsl import *\\\\nfrom typing import *\\\\n\\\\ndef say_hello() :\\\\n    import time\\\\n    time.sleep(1900)\\\\n    hello_text = f'Hello, Suansh!'\\\\n    print(hello_text)\\\\n\\\\n\\\",\\\"--executor_input\\\",\\\"{{$}}\\\",\\\"--function_to_execute\\\",\\\"say_hello\\\"],\\\"env\\\":[{\\\"name\\\":\\\"NO_PROXY\\\",\\\"value\\\":\\\"172.17.68.189,.kubeflow,.local\\\"},{\\\"name\\\":\\\"no_proxy\\\",\\\"value\\\":\\\"172.17.68.189,.kubeflow,.local\\\"}],\\\"resources\\\":{}}]}\"}" version="&Version{Version:v3.3.10,BuildDate:2022-11-29T18:18:30Z,GitCommit:b19870d737a14b21d86f6267642a63dd14e5acd5,GitTag:v3.3.10,GitTreeState:clean,GoVersion:go1.17.13,Compiler:gc,Platform:linux/amd64,}"
time="2024-08-28T14:19:16.141Z" level=info msg="Starting deadline monitor"
time="2024-08-28T14:19:18.142Z" level=info msg="Main container completed"
time="2024-08-28T14:19:18.142Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2024-08-28T14:19:18.142Z" level=info msg="Saving logs"
time="2024-08-28T14:19:18.142Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: artifacts/kubeflow/hello-pipeline-2clrb/2024-08-28/hello-pipeline-2clrb-1334336905/main.log"
time="2024-08-28T14:19:18.142Z" level=info msg="Creating minio client using static credentials" endpoint="minio.kubeflow:9000"
time="2024-08-28T14:19:18.142Z" level=info msg="Saving file to s3" bucket=mlpipeline endpoint="minio.kubeflow:9000" key=artifacts/kubeflow/hello-pipeline-2clrb/2024-08-28/hello-pipeline-2clrb-1334336905/main.log path=/tmp/argo/outputs/logs/main.log
time="2024-08-28T14:19:18.151Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2024-08-28T14:19:18.151Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2024-08-28T14:19:18.151Z" level=info msg="No output parameters"
time="2024-08-28T14:19:18.151Z" level=info msg="No output artifacts"
time="2024-08-28T14:19:18.168Z" level=info msg="Create workflowtaskresults 201"
time="2024-08-28T14:19:18.169Z" level=info msg="Killing sidecars []"
time="2024-08-28T14:19:18.169Z" level=info msg="Alloc=6749 TotalAlloc=12722 Sys=24786 NumGC=4 Goroutines=9"

Following are the logs from

pschoen-itsc commented 3 months ago

@suanshs Do you also have logs of the istio sidecar or do you have no istio deployed?

mmazurekgda commented 2 months ago

Just tested successfully that setting NO_PROXY to '.kubeflow,.local' seems to work together with http(s)_proxy. It makes sense that the connection to ml-pipeline fails without NO_PROXY because then all traffic will be routed through the given proxy. It is just strange that it has seemed to work before updating kubeflow.

Thanks! This helped me a lot!

cybernagle commented 3 weeks ago

@suanshs Do you also have logs of the istio sidecar or do you have no istio deployed?

Hi I'm facing the same issue when using istio-proxy sidecar injected. and with NO_PROXY environment setup not able to fix such issue. :(

cybernagle commented 2 weeks ago

Hi Folks,

I was able to resolve the issue. The root cause was that I was using Istio sidecar injection for the workflow pods. However, during the init container stage, the kfp-launcherattempts to connect to the endpoint metadata-grpc-service.kubeflow:8080 before the Istio-proxy is ready.

I found a related issue here: https://github.com/istio/istio/issues/23802. As suggested, adding the following label to the container resolved the issue:

traffic.sidecar.istio.io/excludeOutboundPorts: "8080"

umka1332 commented 1 week ago

Sorry for late response (fortunately I've gathered more knowladge about the topic now). There are multiple issues with proxy, and it depends on what I was trying to do. One way is to not adding proxy, but then you need to use custom base_image for components that already includes kfp sdk installed and to explicitly tell components to not install kfp sdk. By default a pure python:3.7 image is used and kfp sdk is installed in runtime. Other way is to add proxy and add appropriate for your cluster no_proxy, but you also need to additionally include ,.kubeflow and probably ,.kubeflow,local there. Please note that in this case kubeflow is the namespace, where kubeflow and/or ml-pipelines are installed. Also whenever you set proxy - always set both upper case and lowercase variants of all http_proxy, https_proxy and no_proxy env vars just to be sure.