Open andre-lx opened 1 year ago
/assign @Linchin
Hi @andre-lx, thank you for bringing up this issue. I tried the same pipeline on a newly deployed 2.0.0 cluster, and the run finished without issue. looking at the log you provided, we have
github.com/kubeflow/pipelines/backend/src/v2/metadata.(*Client).PublishExecution(0xc000b29920, {0x20a4878, 0xc000058040}, 0x0, 0x0, {0x0, 0x0, 0xc000b60000?}, 0x4) /go/src/github.com/kubeflow/pipelines/backend/src/v2/metadata/client.go:388 +0x69
The metadata client seems to come from version 2.0.0-rc.2 instead of version 2.0.0. Could you double check if you applied the manifest of version 2.0.0? Try apply the manifest again (here) and see if the issue persists.
Also, could you let me know which way you used to deploy KFP, standalone or via kubeflow?
Hi @Linchin, I just checked and we are using the following image: https://github.com/kubeflow/pipelines/blob/e03e31219387b587b700ba3e31a02df486aa364f/manifests/kustomize/base/metadata/base/kustomization.yaml#L10-L12
The deployment was done using the follwing file: https://github.com/kubeflow/pipelines/blob/2.0.0/manifests/kustomize/env/platform-agnostic-multi-user/kustomization.yaml
Thanks
Hi @andre-lx @Linchin Same issue we are also facing. Did you get a chance to fix it?
Hi @andre-lx @Linchin Same issue we are also facing. Did you get a chance to fix it?
I had to revert it to 1.8.5 for now.
I have the same error. Here are the details.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I also have this issue in my Kubeflow 1.8 environment. Kubeflow 1.8 is using the pipelines backend 2.0.3
I released my environment with the kubeflow manifest 1.8.
Can someone fix this issue?
the same issue on kubeflow 1.8
I have faced a similar issue. I have full Kubeflow 1.8 environment installed and the pipeline backend metadata envoy is 2.0.3 version. Is this issue resolved?
I've faced similar issue, and it was due to proxy setting on the pod/step. After removing proxy setting the issue was gone.
@umka1332 This solved the problem for me also. But do you know a way how I can still set proxy env vars to connect to the internet?
Just tested successfully that setting NO_PROXY to '*.kubeflow,*.local' seems to work together with http(s)_proxy. It makes sense that the connection to ml-pipeline fails without NO_PROXY because then all traffic will be routed through the given proxy. It is just strange that it has seemed to work before updating kubeflow.
If anyone following this can reliably reproduce this issue...
we always get the following error on the third pod that is started
I also need to see the log on the second pod (driver) that is started. Thanks.
@umka1332 This solved the problem for me also. But do you know a way how I can still set proxy env vars to connect to the internet?
Just tested successfully that setting NO_PROXY to '.kubeflow,.local' seems to work together with http(s)_proxy. It makes sense that the connection to ml-pipeline fails without NO_PROXY because then all traffic will be routed through the given proxy. It is just strange that it has seemed to work before updating kubeflow.
How did you solve this? I tried to set the no_proxy environment variables but it did not work for me. @umka1332
@umka1332 This solved the problem for me also. But do you know a way how I can still set proxy env vars to connect to the internet?
Just tested successfully that setting NOPROXY to '.kubeflow,_.local' seems to work together with http(s)_proxy. It makes sense that the connection to ml-pipeline fails without NO_PROXY because then all traffic will be routed through the given proxy. It is just strange that it has seemed to work before updating kubeflow.
How did you solve this? I tried to set the no_proxy environment variables but it did not work for me. @umka1332
Important is to set NO_PROXY
(so all uppercase). Also I had to add the kube api-server IP to NO_PROXY.
1.8.1 kubeflow has the same problem....
I solved this problem by delete proxy, you guys must delete proxy, if you need packages you need make a image that you can use.
from kfp import dsl
from kfp import compiler
@dsl.component()
def say_hello() :
import time
time.sleep(1900)
hello_text = f'Hello!'
print(hello_text)
@dsl.pipeline
def hello_pipeline():
hello_task = say_hello()
hello_task.set_env_variable(name='NO_PROXY', value='*.kubeflow,*.local')
hello_task.set_env_variable(name='no_proxy', value='*.kubeflow,*.local')
hello_task.set_caching_options(False)
compiler.Compiler().compile(hello_pipeline, package_path='pipeline.yaml')
I tried running this but it did not work for me. Is there somethin I am missing here. @pschoen-itsc @umka1332
@suanshs Seems like you are having a different problem. If you don't have any proxies set to begin with, then you also should not need the NO_PROXY settings. Can you provide logs of all the containers of the failing pod?
@pschoen-itsc Following are the logs from main container of the failing pod
time="2024-08-28T14:19:16.866Z" level=info msg="capturing logs" argo=true
time="2024-08-28T14:19:16.900Z" level=info msg="capturing logs" argo=true
I0828 14:19:16.922099 53 launcher_v2.go:90] input ComponentSpec:{
"executorLabel": "exec-say-hello"
}
I0828 14:19:16.922671 53 cache.go:116] Connecting to cache endpoint ml-pipeline.kubeflow:8887
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x941c29]
goroutine 1 [running]:
github.com/kubeflow/pipelines/backend/src/v2/metadata.(*Client).PublishExecution(0xc000afc720, {0x20a4878, 0xc000196000}, 0x0, 0x0, {0x0, 0x0, 0xc0004dc000?}, 0x4)
/go/src/github.com/kubeflow/pipelines/backend/src/v2/metadata/client.go:388 +0x69
github.com/kubeflow/pipelines/backend/src/v2/component.(*LauncherV2).publish(0x467387?, {0x20a4878?, 0xc000196000?}, 0x1?, 0x1?, {0x0?, 0x1a51660?, 0xc0004c6060?}, 0xbbfbb0?)
/go/src/github.com/kubeflow/pipelines/backend/src/v2/component/launcher_v2.go:266 +0x9b
github.com/kubeflow/pipelines/backend/src/v2/component.(*LauncherV2).Execute.func2()
/go/src/github.com/kubeflow/pipelines/backend/src/v2/component/launcher_v2.go:144 +0x65
github.com/kubeflow/pipelines/backend/src/v2/component.(*LauncherV2).Execute(0xc000306460, {0x20a4878, 0xc000196000})
/go/src/github.com/kubeflow/pipelines/backend/src/v2/component/launcher_v2.go:156 +0x91e
main.run()
/go/src/github.com/kubeflow/pipelines/backend/src/v2/cmd/launcher-v2/main.go:98 +0x3ed
main.main()
/go/src/github.com/kubeflow/pipelines/backend/src/v2/cmd/launcher-v2/main.go:47 +0x19
time="2024-08-28T14:19:17.903Z" level=info msg="sub-process exited" argo=true error="<nil>"
Error: exit status 2
time="2024-08-28T14:19:18.871Z" level=info msg="sub-process exited" argo=true error="<nil>"
Error: exit status 2
Following are the logs from wait container
time="2024-08-28T14:19:16.138Z" level=info msg="Starting Workflow Executor" executorType=emissary version=v3.3.10
time="2024-08-28T14:19:16.141Z" level=info msg="Creating a emissary executor"
time="2024-08-28T14:19:16.141Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2024-08-28T14:19:16.141Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=kubeflow podName=hello-pipeline-2clrb-1334336905 template="{\"name\":\"system-container-impl\",\"inputs\":{\"parameters\":[{\"name\":\"pod-spec-patch\",\"value\":\"{\\\"containers\\\":[{\\\"name\\\":\\\"main\\\",\\\"image\\\":\\\"docker-dev-artifactory.workday.com/ml/kubeflow/python-3.7:latest\\\",\\\"command\\\":[\\\"/var/run/argo/argoexec\\\",\\\"emissary\\\",\\\"--\\\",\\\"/kfp-launcher/launch\\\",\\\"--pipeline_name\\\",\\\"hello-pipeline\\\",\\\"--run_id\\\",\\\"5610709d-50b9-4833-8e2d-7e72a19a97ec\\\",\\\"--execution_id\\\",\\\"91\\\",\\\"--executor_input\\\",\\\"{\\\\\\\"inputs\\\\\\\":{},\\\\\\\"outputs\\\\\\\":{\\\\\\\"outputFile\\\\\\\":\\\\\\\"/tmp/kfp_outputs/output_metadata.json\\\\\\\"}}\\\",\\\"--component_spec\\\",\\\"{\\\\\\\"executorLabel\\\\\\\":\\\\\\\"exec-say-hello\\\\\\\"}\\\",\\\"--pod_name\\\",\\\"$(KFP_POD_NAME)\\\",\\\"--pod_uid\\\",\\\"$(KFP_POD_UID)\\\",\\\"--mlmd_server_address\\\",\\\"$(METADATA_GRPC_SERVICE_HOST)\\\",\\\"--mlmd_server_port\\\",\\\"tcp://10.100.242.77:8080\\\",\\\"--\\\"],\\\"args\\\":[\\\"sh\\\",\\\"-c\\\",\\\"\\\\nif ! [ -x \\\\\\\"$(command -v pip)\\\\\\\" ]; then\\\\n python3 -m ensurepip || python3 -m ensurepip --user || apt-get install python3-pip\\\\nfi\\\\n\\\\nPIP_DISABLE_PIP_VERSION_CHECK=1 python3 -m pip install --quiet --no-warn-script-location 'kfp==2.0.1' \\\\u0026\\\\u0026 \\\\\\\"$0\\\\\\\" \\\\\\\"$@\\\\\\\"\\\\n\\\",\\\"sh\\\",\\\"-ec\\\",\\\"program_path=$(mktemp -d)\\\\nprintf \\\\\\\"%s\\\\\\\" \\\\\\\"$0\\\\\\\" \\\\u003e \\\\\\\"$program_path/ephemeral_component.py\\\\\\\"\\\\npython3 -m kfp.components.executor_main --component_module_path \\\\\\\"$program_path/ephemeral_component.py\\\\\\\" \\\\\\\"$@\\\\\\\"\\\\n\\\",\\\"\\\\nimport kfp\\\\nfrom kfp import dsl\\\\nfrom kfp.dsl import *\\\\nfrom typing import *\\\\n\\\\ndef say_hello() :\\\\n import time\\\\n time.sleep(1900)\\\\n hello_text = f'Hello, Suansh!'\\\\n print(hello_text)\\\\n\\\\n\\\",\\\"--executor_input\\\",\\\"{{$}}\\\",\\\"--function_to_execute\\\",\\\"say_hello\\\"],\\\"env\\\":[{\\\"name\\\":\\\"NO_PROXY\\\",\\\"value\\\":\\\"172.17.68.189,.kubeflow,.local\\\"},{\\\"name\\\":\\\"no_proxy\\\",\\\"value\\\":\\\"172.17.68.189,.kubeflow,.local\\\"}],\\\"resources\\\":{}}]}\"}]},\"outputs\":{},\"metadata\":{\"annotations\":{\"sidecar.istio.io/inject\":\"false\"}},\"container\":{\"name\":\"\",\"image\":\"gcr.io/ml-pipeline/should-be-overridden-during-runtime\",\"command\":[\"should-be-overridden-during-runtime\"],\"envFrom\":[{\"configMapRef\":{\"name\":\"metadata-grpc-configmap\",\"optional\":true}}],\"env\":[{\"name\":\"KFP_POD_NAME\",\"valueFrom\":{\"fieldRef\":{\"fieldPath\":\"metadata.name\"}}},{\"name\":\"KFP_POD_UID\",\"valueFrom\":{\"fieldRef\":{\"fieldPath\":\"metadata.uid\"}}}],\"resources\":{},\"volumeMounts\":[{\"name\":\"kfp-launcher\",\"mountPath\":\"/kfp-launcher\"}]},\"volumes\":[{\"name\":\"kfp-launcher\",\"emptyDir\":{}}],\"initContainers\":[{\"name\":\"kfp-launcher\",\"image\":\"gcr.io/ml-pipeline/kfp-launcher@sha256:80cf120abd125db84fa547640fd6386c4b2a26936e0c2b04a7d3634991a850a4\",\"command\":[\"launcher-v2\",\"--copy\",\"/kfp-launcher/launch\"],\"resources\":{\"limits\":{\"cpu\":\"500m\",\"memory\":\"128Mi\"},\"requests\":{\"cpu\":\"100m\"}},\"volumeMounts\":[{\"name\":\"kfp-launcher\",\"mountPath\":\"/kfp-launcher\"}]}],\"archiveLocation\":{\"archiveLogs\":true,\"s3\":{\"endpoint\":\"minio.kubeflow:9000\",\"bucket\":\"mlpipeline\",\"insecure\":true,\"accessKeySecret\":{\"name\":\"mlpipeline-minio-artifact\",\"key\":\"accesskey\"},\"secretKeySecret\":{\"name\":\"mlpipeline-minio-artifact\",\"key\":\"secretkey\"},\"key\":\"artifacts/kubeflow/hello-pipeline-2clrb/2024-08-28/hello-pipeline-2clrb-1334336905\"}},\"podSpecPatch\":\"{\\\"containers\\\":[{\\\"name\\\":\\\"main\\\",\\\"image\\\":\\\"docker-dev-artifactory.workday.com/ml/kubeflow/python-3.7:latest\\\",\\\"command\\\":[\\\"/var/run/argo/argoexec\\\",\\\"emissary\\\",\\\"--\\\",\\\"/kfp-launcher/launch\\\",\\\"--pipeline_name\\\",\\\"hello-pipeline\\\",\\\"--run_id\\\",\\\"5610709d-50b9-4833-8e2d-7e72a19a97ec\\\",\\\"--execution_id\\\",\\\"91\\\",\\\"--executor_input\\\",\\\"{\\\\\\\"inputs\\\\\\\":{},\\\\\\\"outputs\\\\\\\":{\\\\\\\"outputFile\\\\\\\":\\\\\\\"/tmp/kfp_outputs/output_metadata.json\\\\\\\"}}\\\",\\\"--component_spec\\\",\\\"{\\\\\\\"executorLabel\\\\\\\":\\\\\\\"exec-say-hello\\\\\\\"}\\\",\\\"--pod_name\\\",\\\"$(KFP_POD_NAME)\\\",\\\"--pod_uid\\\",\\\"$(KFP_POD_UID)\\\",\\\"--mlmd_server_address\\\",\\\"$(METADATA_GRPC_SERVICE_HOST)\\\",\\\"--mlmd_server_port\\\",\\\"tcp://10.100.242.77:8080\\\",\\\"--\\\"],\\\"args\\\":[\\\"sh\\\",\\\"-c\\\",\\\"\\\\nif ! [ -x \\\\\\\"$(command -v pip)\\\\\\\" ]; then\\\\n python3 -m ensurepip || python3 -m ensurepip --user || apt-get install python3-pip\\\\nfi\\\\n\\\\nPIP_DISABLE_PIP_VERSION_CHECK=1 python3 -m pip install --quiet --no-warn-script-location 'kfp==2.0.1' \\\\u0026\\\\u0026 \\\\\\\"$0\\\\\\\" \\\\\\\"$@\\\\\\\"\\\\n\\\",\\\"sh\\\",\\\"-ec\\\",\\\"program_path=$(mktemp -d)\\\\nprintf \\\\\\\"%s\\\\\\\" \\\\\\\"$0\\\\\\\" \\\\u003e \\\\\\\"$program_path/ephemeral_component.py\\\\\\\"\\\\npython3 -m kfp.components.executor_main --component_module_path \\\\\\\"$program_path/ephemeral_component.py\\\\\\\" \\\\\\\"$@\\\\\\\"\\\\n\\\",\\\"\\\\nimport kfp\\\\nfrom kfp import dsl\\\\nfrom kfp.dsl import *\\\\nfrom typing import *\\\\n\\\\ndef say_hello() :\\\\n import time\\\\n time.sleep(1900)\\\\n hello_text = f'Hello, Suansh!'\\\\n print(hello_text)\\\\n\\\\n\\\",\\\"--executor_input\\\",\\\"{{$}}\\\",\\\"--function_to_execute\\\",\\\"say_hello\\\"],\\\"env\\\":[{\\\"name\\\":\\\"NO_PROXY\\\",\\\"value\\\":\\\"172.17.68.189,.kubeflow,.local\\\"},{\\\"name\\\":\\\"no_proxy\\\",\\\"value\\\":\\\"172.17.68.189,.kubeflow,.local\\\"}],\\\"resources\\\":{}}]}\"}" version="&Version{Version:v3.3.10,BuildDate:2022-11-29T18:18:30Z,GitCommit:b19870d737a14b21d86f6267642a63dd14e5acd5,GitTag:v3.3.10,GitTreeState:clean,GoVersion:go1.17.13,Compiler:gc,Platform:linux/amd64,}"
time="2024-08-28T14:19:16.141Z" level=info msg="Starting deadline monitor"
time="2024-08-28T14:19:18.142Z" level=info msg="Main container completed"
time="2024-08-28T14:19:18.142Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2024-08-28T14:19:18.142Z" level=info msg="Saving logs"
time="2024-08-28T14:19:18.142Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: artifacts/kubeflow/hello-pipeline-2clrb/2024-08-28/hello-pipeline-2clrb-1334336905/main.log"
time="2024-08-28T14:19:18.142Z" level=info msg="Creating minio client using static credentials" endpoint="minio.kubeflow:9000"
time="2024-08-28T14:19:18.142Z" level=info msg="Saving file to s3" bucket=mlpipeline endpoint="minio.kubeflow:9000" key=artifacts/kubeflow/hello-pipeline-2clrb/2024-08-28/hello-pipeline-2clrb-1334336905/main.log path=/tmp/argo/outputs/logs/main.log
time="2024-08-28T14:19:18.151Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2024-08-28T14:19:18.151Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2024-08-28T14:19:18.151Z" level=info msg="No output parameters"
time="2024-08-28T14:19:18.151Z" level=info msg="No output artifacts"
time="2024-08-28T14:19:18.168Z" level=info msg="Create workflowtaskresults 201"
time="2024-08-28T14:19:18.169Z" level=info msg="Killing sidecars []"
time="2024-08-28T14:19:18.169Z" level=info msg="Alloc=6749 TotalAlloc=12722 Sys=24786 NumGC=4 Goroutines=9"
Following are the logs from
@suanshs Do you also have logs of the istio sidecar or do you have no istio deployed?
Just tested successfully that setting NO_PROXY to '.kubeflow,.local' seems to work together with http(s)_proxy. It makes sense that the connection to ml-pipeline fails without NO_PROXY because then all traffic will be routed through the given proxy. It is just strange that it has seemed to work before updating kubeflow.
Thanks! This helped me a lot!
@suanshs Do you also have logs of the istio sidecar or do you have no istio deployed?
Hi I'm facing the same issue when using istio-proxy sidecar injected. and with NO_PROXY environment setup not able to fix such issue. :(
Hi Folks,
I was able to resolve the issue. The root cause was that I was using Istio sidecar injection
for the workflow pods
. However, during the init container
stage, the kfp-launcher
attempts to connect to the endpoint metadata-grpc-service.kubeflow:8080
before the Istio-proxy
is ready.
I found a related issue here: https://github.com/istio/istio/issues/23802. As suggested, adding the following label to the container resolved the issue:
traffic.sidecar.istio.io/excludeOutboundPorts: "8080"
Sorry for late response (fortunately I've gathered more knowladge about the topic now).
There are multiple issues with proxy, and it depends on what I was trying to do.
One way is to not adding proxy, but then you need to use custom base_image
for components that already includes kfp sdk installed and to explicitly tell components to not install kfp sdk. By default a pure python:3.7 image is used and kfp sdk is installed in runtime.
Other way is to add proxy and add appropriate for your cluster no_proxy, but you also need to additionally include ,.kubeflow
and probably ,.kubeflow,local
there. Please note that in this case kubeflow
is the namespace, where kubeflow and/or ml-pipelines are installed.
Also whenever you set proxy - always set both upper case and lowercase variants of all http_proxy
, https_proxy
and no_proxy
env vars just to be sure.
Environment
Steps to reproduce
Hello, we are trying the migration from pipelines 1.8.5 to 2.0.0 but after the apply we are aheving some issues.
Running the "hello world" example from the jupyerlab:
Or running the generated
pipeline.yaml
from the result directly though the UI, we always get the following error on the third pod that is started:The service
ml-pipeline.kubeflow:8887
exists.Everything works great on version 1.8.5.
If you need the logs from the others two pods please let me know. I also check the logs in all the kubeflow services and I can't find any issue.
Impacted by this bug? Give it a 👍.