jupyterhub / mybinder.org-deploy

Deployment config files for mybinder.org
https://mybinder-sre.readthedocs.io/en/latest/index.html
BSD 3-Clause "New" or "Revised" License
76 stars 75 forks source link

Turing unhealthy #2252

Closed minrk closed 1 year ago

minrk commented 2 years ago

The turing cluster is unhealthy for a couple reasons, causing health check failures and various errors.

There appear to be at least two issues:

  1. networking, with lots of internal DNS failures and connection timeouts.
    Error: getaddrinfo EAI_AGAIN proxy-patches
    at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:69:26) {
    errno: -3001,
    code: 'EAI_AGAIN',
    syscall: 'getaddrinfo',
    hostname: 'proxy-patches'
    }
  2. evictions, apparently tied to disk pressure:

    Warning  Evicted    26m   kubelet            The node was low on resource: ephemeral-storage. Container notebook was using 84Ki, which exceeds its request of 0.
    Warning  Evicted    20s   kubelet            The node had condition: [DiskPressure].

I'm not sure if this is just an unhealthy node or system pod, but I haven't been able to track it down. For now, turing is removed from the federation-redirect while we figure it out. @callummole

minrk commented 2 years ago

Since we've had evictions due to disk pressure on two nodes, I think the IOPS throttling issues discussed in https://github.com/Azure/AKS/issues/1320 may be relevant. It could be worth checking the cluster logs in Azure to see if there's anything spamming warnings or errors that could fill up the disk with logs.

@callummole it may be worth checking out the quota issues mentioned here: https://github.com/Azure/AKS/issues/1373 . It's possible that picking a new node configuration is the action to take, I'm not sure.

minrk commented 2 years ago

I think this may be related to the updates to the image-cleaner. Working on a possible fix

callummole commented 2 years ago

I've had a look on Azure at some of virtual machine scale set metrics shown in #1373. There are some spikes in use, but none of the %CPU or %IOPS reach levels one would be concerned about (CPU used max is about ~25%, IOPS consumed max is about 1.5%).

minrk commented 2 years ago

Strange. I can't think of what else could cause intermittent DNS issues. Recycling all nodes might help, but it hasn't seemed to yet.

minrk commented 2 years ago

I spent quite a bit of time poking around and found various issues, such as https://github.com/alan-turing-institute/hub23-deploy/pull/445 and https://medium.com/@amirilovic/how-to-fix-node-dns-issues-5d4ec2e12e95

I didn't deploy node-local-dns, but I doubt that would fix the problem since we have both DNS issues and connection timeout issues (connection timeout can explain DNS, but I don't think DNS can explain connection timeout).

I can't seem to track down what's causing the issues. The network just seems fundamentally unreliable under negligible load. There are no errors or warnings in any networking component that I can find, nothing restarting, etc.

The symptoms are exactly what's described at https://github.com/Azure/AKS/issues/1320#issuecomment-566166756, but if we aren't eating up IOPS, I don't know what could be the issue.

I'm out of ideas other than 'start a new cluster' since even starting entirely new nodes didn't help. Does Turing have a support contract with Azure they can rely on?

callummole commented 2 years ago

I am considering a redeploy to UK south, since then we can take advantage of some reserved instances (and reduce cost).

So perhaps now would be a good time to do that. Perhaps with new node configs.

As for the IOPS issue, there are certainly spikes, it's just that the load is always low. Perhaps it doesn't take much to cause the issue.

minrk commented 2 years ago

OK, if deploying a new cluster is desirable for other reasons, let's maybe do that and hope the issue gets fixed and we never need to understand it.

callummole commented 2 years ago

Post deployment, we are still having intermittency issues, where turing.mybinder.org sometimes shows. 504 gateway timeout. My investigations:

This seems to be concurrent with the log entry:

[W 2022-08-15 16:11:30.253 JupyterHub proxy:851] api_request to the proxy failed with status code 599, retrying...
[W 2022-08-15 16:11:40.385 JupyterHub proxy:851] api_request to the proxy failed with status code 599, retrying...
[W 2022-08-15 16:12:00.601 JupyterHub proxy:851] api_request to the proxy failed with status code 599, retrying...
[E 2022-08-15 16:12:00.606 JupyterHub ioloop:761] Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7f05db27bdf0>>, <Task finished name='Task-12128' coro=<JupyterHub.update_last_activity() done, defined at /usr/local/lib/python3.8/dist-packages/jupyterhub/app.py:2666> exception=TimeoutError('Repeated api_request to proxy path "" failed.')>)
    Traceback (most recent call last):
      File "/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py", line 741, in _run_callback
        ret = callback()
      File "/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py", line 765, in _discard_future_result
        future.result()
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/app.py", line 2668, in update_last_activity
        routes = await self.proxy.get_all_routes()
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/proxy.py", line 898, in get_all_routes
        resp = await self.api_request('', client=client)
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/proxy.py", line 862, in api_request
        result = await exponential_backoff(
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/utils.py", line 184, in exponential_backoff
        raise TimeoutError(fail_message)
    TimeoutError: Repeated api_request to proxy path "" failed.
16:11:00.278 [ConfigProxy] error: Failed to get custom error page: Error: connect ETIMEDOUT 10.0.84.44:80
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1148:16) {
  errno: -110,
  code: 'ETIMEDOUT',
  syscall: 'connect',
  address: '10.0.84.44',
  port: 80
}
16:11:00.278 [ConfigProxy] error: Failed to get custom error page: Error: connect ETIMEDOUT 10.0.84.44:80
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1148:16) {
  errno: -110,
  code: 'ETIMEDOUT',
  syscall: 'connect',
  address: '10.0.84.44',
  port: 80
}
16:11:00.278 [ConfigProxy] error: 503 GET /user/ines-spacy-course-j0l8zx6m/api/kernels/de027600-a0a4-4b2a-aace-361f98171436/channels connect ETIMEDOUT 10.244.1.92:8888
16:11:55.574 [ConfigProxy] error: 503 GET /user/ines-spacy-course-grkob94g/api/kernels/e1a7749a-5700-491b-b9e1-0ad8c3ed290d/channels read EHOSTUNREACH
16:11:57.622 [ConfigProxy] error: 503 GET /user/ines-spacy-course-0go4ikfz/api/kernels/7e7c12fb-e0f3-4796-9bbc-747f82e32efe/channels read EHOSTUNREACH
16:12:05.041 [ConfigProxy] error: 503 GET /user/holoviz-panel-hp8okeac/lab/extensions/@voila-dashboards/jupyterlab-preview/static/remoteEntry.ec13ba4c2bfbf9fe7e8a.js socket hang up
16:13:05.043 [ConfigProxy] error: 503 GET /user/holoviz-panel-hp8okeac/lab/extensions/@voila-dashboards/jupyterlab-preview/static/remoteEntry.ec13ba4c2bfbf9fe7e8a.js socket hang up
16:13:21.190 [ConfigProxy] error: 503 GET /user/ipython-ipython-in-depth-y9gpcg8f/static/base/images/favicon-busy-1.ico socket hang up

We are also getting the same errors as the @minrk's original post:

16:32:02.803 [ConfigProxy] error: 503 GET /user/gromacs-online---intro-tutorial-jlyknr1m/api/metrics/v1 Error: getaddrinfo EAI_AGAIN proxy-patches
    at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:69:26) {
  errno: -3001,
  code: 'EAI_AGAIN',
  syscall: 'getaddrinfo',
  hostname: 'proxy-patches'
}

Also these sort of errors appear in the ingress controller logs:

2022/08/15 16:24:41 [error] 433#433: *8198296 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 172.104.63.72, server: hub.mybinder.turing.ac.uk, request: "GET /user/hanlpbot-hanlp-binder-4l9ipomc/api/kernels/916acfec-876b-4d01-a5f3-2fd2d3ec7a95/channels?session_id=1c240255-84db-4c0e-97b8-ec5b3d4fe519&token=6Sy10PBeQOeDSEbbZwSjUw HTTP/1.1", upstream: "http://10.244.0.24:8000/user/hanlpbot-hanlp-binder-4l9ipomc/api/kernels/916acfec-876b-4d01-a5f3-2fd2d3ec7a95/channels?session_id=1c240255-84db-4c0e-97b8-ec5b3d4fe519&token=6Sy10PBeQOeDSEbbZwSjUw", host: "hub.mybinder.turing.ac.uk"

The nodes are now higher spec than before, so we can rule out that option.

callummole commented 2 years ago

After some googling, I've learned that CPU throttling could be a property of the container, rather than the node it is running on. So perhaps this might not show up in the azure metrics I looked at before? Similar issues are happening on the turing-staging cluster, so it could be some interaction between a container and azure. Not entirely sure what the next investigation step is there.

We could also try node-local-dns. I know you were skeptical, @minrk, but seems worth a try given that it seems to be suggested in a few cases ([e.g. 1,2)...and I'm currently not sure what else to try.

manics commented 2 years ago

What CNI are you using? https://medium.com/asos-techblog/an-aks-performance-journey-part-2-networking-it-out-e253f5bb4f69 mentions problems with DNS race conditions and NodeJS, similar to your other links, but it also says

In January 2021 Microsoft released an update to the default settings for Azure CNI which changes the default to use Transparent Mode. This new default is applied to all new clusters, and is automatically applied to any existing clusters when they are upgraded. This change places Pod to Pod communication over layer 3 (L3) instead of the old layer 2 (L2) bridge, using IP routes which the CNI adds. The benefits look very promising, the main ones related to our discussion being:

It provides mitigation for the conntrack race conditions we’ve discussed without the need to set up NodeLocal DNSCache. It eliminates some additional 5s DNS latency which the old Bridge mode introduced due to a ‘just in time’ setup.

callummole commented 2 years ago

The cluster network uses kubenet, not Azure CNI: https://docs.microsoft.com/en-us/azure/aks/concepts-network#azure-virtual-networks.

From reading around it seems that specifying Azure CNI (where every pod gets allocated an IP) is preferred (but I don't see why that would cause our error).

callummole commented 2 years ago

Did some more unfruitful debugging today. It resulting in two changes, but neither of them seem to help. Logging it here in case it is useful.

1) ClusterRoles:

I was following the kubernetes docs on Debugging DNS resolution. I noticed that the coredns cluster role does not have the expected permissions 'nodes:get'.

➜  ~ k describe clusterrole system:coredns -n kube-system
Name:         system:coredns
Labels:       addonmanager.kubernetes.io/mode=Reconcile
              kubernetes.io/bootstrapping=rbac-defaults
Annotations:  <none>
PolicyRule:
  Resources                        Non-Resource URLs  Resource Names  Verbs
  ---------                        -----------------  --------------  -----
  endpoints                        []                 []              [list watch]
  namespaces                       []                 []              [list watch]
  pods                             []                 []              [list watch]
  services                         []                 []              [list watch]
  endpointslices.discovery.k8s.io  []                 []              [list watch]

Example coredns clusterrole spec given here: https://github.com/kelseyhightower/kubernetes-the-hard-way/blob/master/deployments/coredns-1.7.0.yaml

So I changed the clusterroles to looks as the docs suggest:

➜  ~ k describe clusterrole system:coredns -n kube-system
Name:         system:coredns
Labels:       addonmanager.kubernetes.io/mode=Reconcile
              kubernetes.io/bootstrapping=rbac-defaults
Annotations:  <none>
PolicyRule:
  Resources                        Non-Resource URLs  Resource Names  Verbs
  ---------                        -----------------  --------------  -----
  nodes                            []                 []              [get]
  endpoints                        []                 []              [list watch]
  namespaces                       []                 []              [list watch]
  pods                             []                 []              [list watch]
  services                         []                 []              [list watch]
  endpointslices.discovery.k8s.io  []                 []              [list watch]

Though this hasn't helped.

2) Added subnet to aks network security group.

This microsoft ticket presents very similarly. Since our coredns logs are showing the following warnings:

[WARNING] No files matching import glob pattern: custom/*.override
[WARNING] No files matching import glob pattern: custom/*.override

Note - the warnings are likely due to missing imports, and unimportant, see coredns issue #3600

But there are also errors in the logs, like:

[ERROR] plugin/errors: 2 www.hanlp.com. A: read udp 10.244.1.236:58932->168.63.129.16:53: i/o timeout
W0816 15:22:54.664468       1 reflector.go:442] pkg/mod/k8s.io/client-go@v0.23.1/tools/cache/reflector.go:167: watch of *v1.Service ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0816 15:22:54.664464       1 reflector.go:442] pkg/mod/k8s.io/client-go@v0.23.1/tools/cache/reflector.go:167: watch of *v1.EndpointSlice ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0816 15:22:54.664464       1 reflector.go:442] pkg/mod/k8s.io/client-go@v0.23.1/tools/cache/reflector.go:167: watch of *v1.Namespace ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
[ERROR] plugin/errors: 2 youtube.com. AAAA: read udp 10.244.1.236:45116->10.255.255.255:53: i/o timeout
[ERROR] plugin/errors: 2 youtube.com. A: read udp 10.244.1.236:59947->10.255.255.255:53: i/o timeout
[ERROR] plugin/errors: 2 youtube.com. AAAA: read udp 10.244.1.236:33466->10.255.255.255:53: i/o timeout
[ERROR] plugin/errors: 2 youtube.com. A: read udp 10.244.1.236:45049->10.255.255.255:53: i/o timeout
[ERROR] plugin/errors: 2 www.hanlp.com. AAAA: read udp 10.244.1.236:42698->168.63.129.16:53: i/o timeout
[ERROR] plugin/errors: 2 www.hanlp.com. AAAA: read udp 10.244.1.236:53056->168.63.129.16:53: i/o timeout
[ERROR] plugin/errors: 2 www.hanlp.com. AAAA: read udp 10.244.1.236:34372->168.63.129.16:53: i/o timeout

Following along with the ticket I noticed that in the Azure subnet the network security group is not associated with the relevant aks security group. Have changed this to match with the aks nsg. .

callummole commented 2 years ago

FYI, I have raised a support ticket with azure about this.

consideRatio commented 2 years ago

Related, for reference

manics commented 2 years ago

I regularly see intermittent issues with Turing, sometimes on launch, sometimes after the server is already running (JupyterLab shows the server connection error dialog). The GitHub workflows almost always return an error status for Turing staging and production- the deploy completes but the verification check fails.

Since this is likely to leave a poor impression on users who end up n Turing should we remove it from federationRedirect?

consideRatio commented 2 years ago

I think that would be good!

rgaiacs commented 1 year ago

I believe we can close this given https://github.com/jupyterhub/mybinder.org-deploy/pull/2396 and https://github.com/jupyterhub/mybinder.org-deploy/pull/2528.