Kubeflow Pipelines timing out on Azure deployment

maganaluis commented 3 years ago

What steps did you take:

Run a in-house Kubeflow notebook template.
Get a timeout when running the pipeline.

What happened:

When submitting a pipeline I'm getting a timeout, this has been happening sporadically.

# Run the pipeline on Kubeflow cluster
pipeline_run = (
    kfp
    .Client(host=f'{host}/pipeline', cookies=cookies)
    .create_run_from_pipeline_func(
        pipeline,
        arguments={},
        experiment_name=experiment_name,
        namespace=namespace,
        run_name=pipeline_name
    )
)

/opt/conda/lib/python3.7/site-packages/kfp_server_api/rest.py in request(self, method, url, query_params, headers, body, post_params, _preload_content, _request_timeout)
    236 
    237         if not 200 <= r.status <= 299:
--> 238             raise ApiException(http_resp=r)
    239 
    240         return r

ApiException: (504)
Reason: Gateway Timeout
HTTP response headers: HTTPHeaderDict({'content-length': '24', 'content-type': 'text/plain', 'date': 'Mon, 24 Aug 2020 20:24:57 GMT', 'server': 'envoy', 'x-envoy-upstream-service-time': '300028'})
HTTP response body: upstream request timeout

What did you expect to happen:

Consistency with the Kubeflow Pipelines API

Environment:

Azure AKS

How did you deploy Kubeflow Pipelines (KFP)?

KFP version: 1.0.0

KFP SDK version: 1.0.0

Anything else you would like to add:

I'm looking for a way to manipulate the tcp keepalive on Kubeflow pipelines, it's hard to tell if this error is on Kubeflow pipelines or Argo. On the Kubeflow pipelines API these calls hanged for a while and never seem to release:

I0824 20:20:00.099206       6 util.go:396] Authorized user e23e799e-de9b-4388-99e7-8efe8ab6c072 in namespace e23e799e-de9b-4388-99e7-8efe8ab6c072
I0824 20:21:58.067818       6 interceptor.go:29] /api.RunService/CreateRun handler starting
I0824 20:21:58.753968       6 util.go:396] Authorized user e23e799e-de9b-4388-99e7-8efe8ab6c072 in namespace e23e799e-de9b-4388-99e7-8efe8ab6c072
I0824 20:23:58.117694       6 interceptor.go:29] /api.RunService/CreateRun handler starting
I0824 20:23:58.798816       6 util.go:396] Authorized user e23e799e-de9b-4388-99e7-8efe8ab6c072 in namespace e23e799e-de9b-4388-99e7-8efe8ab6c072

We know the Azure platform has issues with the Kubernetes API and you have to tweak the tcp keep alive in the applications, so perhaps this could be a solution.

[Miscellaneous information that will assist in solving the issue.]

/kind bug

maganaluis commented 3 years ago

For contrast we had to modify the Jupyter Web API on Kubeflow to avoid timeouts by adding the code below before instantiating the kubernetes api client, this also solved issues with other applications relying on that API such as Airflow, and JupyterHub. We are wondering if this also will be an issue with KFP.

import socket
from urllib3 import connection
# workaround for azure load balancer issue
connection.HTTPConnection.default_socket_options += [(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1),
                                            (socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, 60),
                                            (socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 60),
                                            (socket.IPPROTO_TCP, socket.TCP_KEEPCNT, 3)]

dtzar commented 3 years ago

@maganaluis What version / configuration of AKS are you using?

Do you experience the same problem on KFP 1.1?

maganaluis commented 3 years ago

@dtzar We are using AKS with Kubernetes version 1.16.10

I don't think there is KFP 1.1 (Kubeflow Pipelines) we are using their latest image tag, however we are using Kubeflow version 1.1

I just reviewed the latest logs, and as I mentioned it could be related to timeouts on the Argo api, and not KFP.

I0824 20:35:44.840283       6 error.go:218] Post https://10.0.0.1:443/apis/argoproj.io/v1alpha1/namespaces/e23e799e-de9b-4388-99e7-8efe8ab6c072/workflows: unexpected EOF

I'm attaching the logs for reference.

ml-pipeline.log

dtzar commented 3 years ago

I'm curious if you would try to use our sample repo install process if you'd have the same problem (for KFP and/or AKS which uses K8s 1.18.x and some other AKS latest features). I do know that 1.1 RC manifests use a very old argo workflow controller of version 2.3.0 which has a bunch of problems.

Ark-kun commented 3 years ago

I do know that 1.1 RC manifests use a very old argo workflow controller of version 2.3.0 which has a bunch of problems.

/cc @Bobgy @rmgogogo

maganaluis commented 3 years ago

@dtzar I can definitely try that, I will start with running a k8s api test on version 1.18.x I'm hoping that version allows the job below to pass. Modifying the Argo installation will take some time but we can try that as well.

https://github.com/maganaluis/k8s-api-job/blob/master/job.py

Update 08/25/2020:

That job still times out on 1.18.X, we will update Argo on our Dev cluster.

maganaluis commented 3 years ago

@dtzar I upgraded Argo to 2.10.0, interestingly the only thing I had to change on the clusterwide installation was the workflow-controller-configmap and it Kubeflow Pipelines worked as expected. However, the timeouts are still there and they are getting more consistent, may be I'm getting crazy here. Could you share the configuration you're using? May be VM types? Istio version?

dtzar commented 3 years ago

https://github.com/kaizentm/manifests/blob/eedorenko/kfdef-azure/kfdef/kfctl_azure.v1.1.0.yaml

maganaluis commented 3 years ago

So turns out this was an issue with AKS; I did develop a simple kubernetes api job to prove this:

https://github.com/maganaluis/k8s-api-golang/blob/time-out/analysis.md

The job has just about all the same library versions that the Kubeflow Pipelines API has, and it will use the Argo client to submit a small workflow mimicking what KFP does here. The job will stay idle for 5 minutes between submissions, which is normal behavior, you don't expect the API to be used 100 percent of the time.

If you run this job without the Istio sidecar enabled the job will complete, this is because of the tcp keep alive settings on golang are fairly robust. However the istio sidecar will alter these settings and probably use the linux defaults.

This is not to say that the fault here is on the istio sidecar, because we did run the same job on AWS and GCP and the job passed without timeouts or connections dropped. Regardless the sidecar is required for multi-user so it must be enabled.

Istio provides you with a quite powerful tool which allows you to set the tcp keep alive setting going to any service in Kubernetes. You can read more about it here:

https://preliminary.istio.io/latest/docs/reference/config/networking/destination-rule/#ConnectionPoolSettings

So the solution here is to setup a destination rule for the kubernetes api:

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: kubernetes-api
spec:
  host: "kubernetes.default.svc.cluster.local"
  trafficPolicy:
    connectionPool:
      tcp:
        connectTimeout: 10s
        tcpKeepalive:
          time: 75s
          interval: 75s

I would like to see this rule at the Kubeflow installation level to ensure Kubeflow works on any cloud platform. This not only will solve timeouts with Kubeflow Pipelines but any other service that uses the Kubernetes API. Anyway, I'm leaving this here in case anyone else stumbles upon the same issue.

@Ark-kun @Bobgy @rmgogogo @dtzar

Bobgy commented 3 years ago

@maganaluis thanks for the investigation!

I'm okay with adding the destination rule. It doesn't seem harmful to other platforms. Would you mind opening a PR for it?

maganaluis commented 3 years ago

@Bobgy Sounds good, I'll make the PR. We can always tune the settings on the tcp keep alive to ensure it works across all platforms.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

berndverst commented 3 years ago

The PR is currently being reviewed so there should be an update here soon.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 2 years ago

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.