Closed maganaluis closed 2 years ago
For contrast we had to modify the Jupyter Web API on Kubeflow to avoid timeouts by adding the code below before instantiating the kubernetes api client, this also solved issues with other applications relying on that API such as Airflow, and JupyterHub. We are wondering if this also will be an issue with KFP.
import socket
from urllib3 import connection
# workaround for azure load balancer issue
connection.HTTPConnection.default_socket_options += [(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1),
(socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, 60),
(socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 60),
(socket.IPPROTO_TCP, socket.TCP_KEEPCNT, 3)]
@maganaluis What version / configuration of AKS are you using?
Do you experience the same problem on KFP 1.1?
@dtzar We are using AKS with Kubernetes version 1.16.10
I don't think there is KFP 1.1 (Kubeflow Pipelines) we are using their latest image tag, however we are using Kubeflow version 1.1
I just reviewed the latest logs, and as I mentioned it could be related to timeouts on the Argo api, and not KFP.
I0824 20:35:44.840283 6 error.go:218] Post https://10.0.0.1:443/apis/argoproj.io/v1alpha1/namespaces/e23e799e-de9b-4388-99e7-8efe8ab6c072/workflows: unexpected EOF
I'm attaching the logs for reference.
I'm curious if you would try to use our sample repo install process if you'd have the same problem (for KFP and/or AKS which uses K8s 1.18.x and some other AKS latest features). I do know that 1.1 RC manifests use a very old argo workflow controller of version 2.3.0 which has a bunch of problems.
I do know that 1.1 RC manifests use a very old argo workflow controller of version 2.3.0 which has a bunch of problems.
/cc @Bobgy @rmgogogo
@dtzar I can definitely try that, I will start with running a k8s api test on version 1.18.x I'm hoping that version allows the job below to pass. Modifying the Argo installation will take some time but we can try that as well.
https://github.com/maganaluis/k8s-api-job/blob/master/job.py
Update 08/25/2020:
That job still times out on 1.18.X, we will update Argo on our Dev cluster.
@dtzar I upgraded Argo to 2.10.0, interestingly the only thing I had to change on the clusterwide installation was the workflow-controller-configmap and it Kubeflow Pipelines worked as expected. However, the timeouts are still there and they are getting more consistent, may be I'm getting crazy here. Could you share the configuration you're using? May be VM types? Istio version?
So turns out this was an issue with AKS; I did develop a simple kubernetes api job to prove this:
https://github.com/maganaluis/k8s-api-golang/blob/time-out/analysis.md
The job has just about all the same library versions that the Kubeflow Pipelines API has, and it will use the Argo client to submit a small workflow mimicking what KFP does here. The job will stay idle for 5 minutes between submissions, which is normal behavior, you don't expect the API to be used 100 percent of the time.
If you run this job without the Istio sidecar enabled the job will complete, this is because of the tcp keep alive settings on golang are fairly robust. However the istio sidecar will alter these settings and probably use the linux defaults.
This is not to say that the fault here is on the istio sidecar, because we did run the same job on AWS and GCP and the job passed without timeouts or connections dropped. Regardless the sidecar is required for multi-user so it must be enabled.
Istio provides you with a quite powerful tool which allows you to set the tcp keep alive setting going to any service in Kubernetes. You can read more about it here:
So the solution here is to setup a destination rule for the kubernetes api:
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: kubernetes-api
spec:
host: "kubernetes.default.svc.cluster.local"
trafficPolicy:
connectionPool:
tcp:
connectTimeout: 10s
tcpKeepalive:
time: 75s
interval: 75s
I would like to see this rule at the Kubeflow installation level to ensure Kubeflow works on any cloud platform. This not only will solve timeouts with Kubeflow Pipelines but any other service that uses the Kubernetes API. Anyway, I'm leaving this here in case anyone else stumbles upon the same issue.
@Ark-kun @Bobgy @rmgogogo @dtzar
@maganaluis thanks for the investigation!
I'm okay with adding the destination rule. It doesn't seem harmful to other platforms. Would you mind opening a PR for it?
@Bobgy Sounds good, I'll make the PR. We can always tune the settings on the tcp keep alive to ensure it works across all platforms.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
The PR is currently being reviewed so there should be an update here soon.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.
What steps did you take:
What happened:
When submitting a pipeline I'm getting a timeout, this has been happening sporadically.
What did you expect to happen:
Consistency with the Kubeflow Pipelines API
Environment:
Azure AKS
How did you deploy Kubeflow Pipelines (KFP)?
KFP version: 1.0.0
KFP SDK version: 1.0.0
Anything else you would like to add:
I'm looking for a way to manipulate the tcp keepalive on Kubeflow pipelines, it's hard to tell if this error is on Kubeflow pipelines or Argo. On the Kubeflow pipelines API these calls hanged for a while and never seem to release:
We know the Azure platform has issues with the Kubernetes API and you have to tweak the tcp keep alive in the applications, so perhaps this could be a solution.
[Miscellaneous information that will assist in solving the issue.]
/kind bug