consideRatio commented 4 years ago

I recently merged #1004 that had successful CI tests but then reverted the merge in #1356 after concluding my upgrade failed.

I'm not confident on what goes on. My hub and proxy pod wasn't entering a Ready state after this PR was merged by me. So I'll investigate things further but for now I figure I'll revert the PR so it does not cause disruptions for others like me.

Events:
  Type     Reason     Age                From                                         Message
  ----     ------     ----               ----                                         -------
  Normal   Scheduled  78s                default-scheduler                            Successfully assigned jupyterhub/proxy-7d4dcc5c64-pqv9w to gke-ds-platform-core-4d566784-54l4
  Normal   Pulled     76s                kubelet, gke-ds-platform-core-4d566784-54l4  Container image "jupyterhub/configurable-http-proxy:4.1.0" already present on machine
  Normal   Created    76s                kubelet, gke-ds-platform-core-4d566784-54l4  Created container
  Normal   Started    76s                kubelet, gke-ds-platform-core-4d566784-54l4  Started container
  Warning  Unhealthy  10s (x7 over 70s)  kubelet, gke-ds-platform-core-4d566784-54l4  Readiness probe failed: Get http://10.72.0.84:8000/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Events:
  Type     Reason     Age               From                                         Message
  ----     ------     ----              ----                                         -------
  Normal   Scheduled  31s               default-scheduler                            Successfully assigned jupyterhub/hub-b9dccf7f7-htrxq to gke-ds-platform-core-4d566784-lspf
  Normal   Pulled     29s               kubelet, gke-ds-platform-core-4d566784-lspf  Container image "jupyterhub/k8s-hub:0.9-dcde99a" already present on machine
  Normal   Created    29s               kubelet, gke-ds-platform-core-4d566784-lspf  Created container
  Normal   Started    29s               kubelet, gke-ds-platform-core-4d566784-lspf  Started container
  Warning  Unhealthy  6s (x3 over 26s)  kubelet, gke-ds-platform-core-4d566784-lspf  Liveness probe failed: Get http://10.72.2.185:8081/health: dial tcp 10.72.2.185:8081: connect: connection refused
  Warning  Unhealthy  1s (x3 over 21s)  kubelet, gke-ds-platform-core-4d566784-lspf  Readiness probe failed: Get http://10.72.2.185:8081/health: dial tcp 10.72.2.185:8081: connect: connection refused

I wonder if this is related to:

Require use of HTTPS, and potential need to configure the liveness/readiness probe with scheme: HTTPS. UPDATE: No I'm quite confident this isn't it after trying to access the endpoints from another pod. Using http should be fine.
My nginx-ingress-controller that routes traffic, but then... why? This request should be coming from kubelet which runs on a node, and it is directly asking for the pod IP, so why would the nginx-ingress-controller be used to route traffic? Related to this, providing a Host: header, as this controller routes based on web-request's host header.
Something about redirect in the response? UPDATE: No I think k8s should approve any response between 200 and <400, also it sais timeout.
My hub pod could perhaps be stuck in an unresponsive state by getting stuck waiting for singleuser-server pods and not letting that wait timeout fast enough etc?

consideRatio commented 4 years ago

I think the issue is the readiness probe of the proxy that redirects to the hub readiness and checks, and the hub pod's liveness/readiness check are problematic.

consideRatio commented 4 years ago

Hmmmm yes okay so I got it to work again later now... Why? I think my hub pod failed to enter a responsive state because it got stuck waiting for response from singleuser servers, thereby restarting due to failed liveness checks, and got stuck again etc. So why does it get stuck waiting for the singleuser servers like this etc?

I figure a good option for now could perhaps be to enable these liveness probes optionally or at least have a flag to disable them.

manics commented 4 years ago

Do you know if your hub was completely unresponsive, so even if the health checks were disabled it'd still be unusable, or if the health check was giving an incorrect response?

consideRatio commented 4 years ago

I'm not sure yet @manics, it is a bit hard to reproduce and will probably cause some user disruptions trying to do it. I'll live test these probes further.

@manics I've now digged into the code base, and can conclude that tornado handlers are not initialized until various other methods have completed, and they can take 30+ seconds to complete. Now I'm quite confident my issue arose due to this. So, the key point would be do resolve that.

consideRatio commented 4 years ago

Suggested fixes

Fix 1 - /health should go to /hub/health

If we don't, we end up with lots of redirects. I think this is fine no matter what, but it is no point in bouncing around. I've also learned that if the redirect is to the same pod, it will be followed, but if the redirect is to another pod, it won't be followed after k8s 1.14.1 I think. This could cause issues in the future, I'm not 100%, anyhow, lets avoid it by directly requesting /hub/health instead of /health.

Hmmm but at the same time... is the /hub prefix something that will remain the same for all z2jh deployments? I'm not 100% on this. I'd really appreciate input here.

Fix 2 - adjust liveness check params

Details about liveness / readiness probes in the docs can be found here. I think the failure I ended up with relates to quite hard constraints on these. That the failure was caused by not having a ready hub pod within 30 seconds, (3 failure periods * 10 second / period + 0 initial seconds).

I think increasing the value to 5 for the liveness probe's failure threshold may be enough to resolve my issue.

The default values for the liveness/readiness probes are...

    epectedProbe := v1.Probe{
        InitialDelaySeconds: 0,
        TimeoutSeconds:      1,
        PeriodSeconds:       10,
        SuccessThreshold:    1,
        FailureThreshold:    3,
    }

Fix 3 - adjust timeouts blocking hub's tornado endpoint startup

Before tornado starts serving stuff, other functions like init_spawner needs to finish.

https://github.com/jupyterhub/jupyterhub/blob/915664ede2081838ae18f19806d7b8ecc422eae6/jupyterhub/app.py#L2159-L2162

But init_spawners() will await multiple requests sent out to all the pre-existing pods within multiple check_spawner() calls.

https://github.com/jupyterhub/jupyterhub/blob/915664ede2081838ae18f19806d7b8ecc422eae6/jupyterhub/app.py#L1924-L1937

and the check_spawner() calls awaits the _wait_up() function defined on the User object.

https://github.com/jupyterhub/jupyterhub/blob/915664ede2081838ae18f19806d7b8ecc422eae6/jupyterhub/app.py#L1893-L1901

and the _wait_up() function times out after c.KubeSpawner.http_timeout of time.

https://github.com/jupyterhub/jupyterhub/blob/915664ede2081838ae18f19806d7b8ecc422eae6/jupyterhub/user.py#L663-L668

The default value of http_timeout is not modified by z2jh, and is therefor the default value of 30 seconds as seen in Kubespawner's docs:

https://jupyterhub-kubespawner.readthedocs.io/en/latest/spawner.html

I think this is why I ran into issues. My hub pod got stuck waiting to get a response from a server that it could not reach, and after a while the liveness probe restarted the hub pod. So, changing the http_timeout will have a great impact on the hub pod startup, that impacts if the liveness probe settings for the hub can make sense.

Questions raised

Q1 - can users with a started singleuser servers continue to work with a hub pod down?

If we couple the proxy to the hub, that makes the proxy-public service return "service unavailable" if the hub pod isn't responding. Do we really want this? Is it possible for the proxy to let users work on with a hub pod being down?

I can imagine that the hub pod needs to verify authentication and authorization etc to access, but I can also imagine that the proxy pod can know it is OK at least for a while assuming some cookie is set etc... Hmmm... I let Q3 be created to represent this question.

Q2 - do we avoid circular issues?

If the hub relies on the proxy to setup, and the proxy isn't reachable because of the readiness probe referencing the hub, then we have a serious stability issue.

I investigated this to some degree, and I think we are good. I figure that when the hub starts, it runs "initialize()" and then it runs "start()" where it actually binds to the network interface and traffic can start to arrive and it can become responsive. During initialize(), it will run checks on all preexisting routes it knows of within its stored state. After that, within start() but after it has started listening on requests, it will update the proxy to route properly within the check_routes() function and setup a periodic call to this function.

From what I understand, the verification calls in check_spawner() is also made directly to the pods and won't go through the proxy as can be guessed from the hub logs. Below are some lines from the hub logs that verifies some responses from different pod IP's so it really does seem like this is not going through the proxy.

[D 2019-08-16 09:30:27.080 JupyterHub utils:218] Server at http://10.72.7.38:8888/user/redacted-user-0/ responded with 302
[D 2019-08-16 09:30:27.083 JupyterHub utils:218] Server at http://10.72.7.37:8888/user/redacted-user-1/ responded with 302
[D 2019-08-16 09:30:27.084 JupyterHub utils:218] Server at http://10.72.7.39:8888/user/redacted-user-2/ responded with 302
[D 2019-08-16 09:30:27.086 JupyterHub utils:218] Server at http://10.72.7.41:8888/user/redacted-user-3/ responded with 302

Answer: I think we avoid circular issues where the hub pod relies on the proxy pod and the other way around etc, but I'm not confident on that. We should be if we configure the readiness probe of the proxy to the /hub/health endpoint of the hub.

Q3 - what is the authorization flow for a user asking the proxy for access to its server?

Will the proxy always ask the hub if the user has access to the user pod or will it sometimes use cached credentials?

Q4 - is the proxy decoupling itself from traffic?

It is my understanding that the proxy pod will redirect the user directly to the pod, but I'm a bit confused here...

Hmmm okay I tried accessing myhub.example.com/user/some-other-user. That worked, and I ended up at jupyterhub being asked for permission to provide that user-server with information about my already logged in user. This means, that the proxy will proxy no matter what, and if the user server isn't provided with a suitable cookie with access, then that's the issue.

So, I therefor think the proxy is channeling all the traffic through itself, and therefor i see no reason for the user server to be inaccessible if the hub pod is down for a while. So, from this I conclude that I think the readiness of the proxy should not be tied to the hub pod being up.

I think the readiness of the proxy pod could be allowed to be /_chp_healthz as well, it should be able to shuffle traffic to user pods without the hub being up. Hmmm... I think what is of importance is that the proxy has been configured with the hub at one time, and that it is alive by /_chp_healthz. A loss of connectivity to the hub should not crash the proxy, but not ever having information from the hub doesn't make it ready. I'm curious to learn more about the /_chp_healthz endpoint... Okay so the _chp_healthz endpoint returns OK no matter what if it is accessible at all, nothing fancy. So that aligns quite well with the proxy is able to parse requests at all probably. But, it does not imply that it has had time to configure itself.

Hmmm... So i think pods are accessible without the readiness probe being OK, but services won't send traffic there until it is. Hmmm... I'm thinking about switchover scenarios when we a rolling upgrade of proxy pods for example, and one starts up alongside another... Hmmm...

Inspecting the proxy pod, it starts up with a default route to the hub, but it needs also needs to be actively configured by the hub after its startup. But, the hub is really only configuring the proxy k8s service which will be one of the already ready proxy pods!

Conclusions so far: Proxy pods needs to be configured by the hub. But this is not trivial to do well when considering there may be multiple proxy pods.

Web requests sent to a k8s services like the proxy-api service will channel the request to a single proxy pod and get that pods information.
The next web request sent can end up at another ready proxy pod.

If we could instead configure the routes externally to the pods, and let the pods use that information, then we would avoid such challenges. We could also avoid them by having the proxy pod poll the hub for configuration about routes. Hmm...

Anyhow, at the moment, we need new proxy pods to be able to get configured at all in the first place, and that at least require them to be ready. To avoid circular issues, they would need to self-configure or at least without the need for a web-based request to arrive to them until they are ready.

I've learned that if we don't defined a readiness probe, they will be ready right away when all the containers are started in the pod.

manics commented 4 years ago

https://jupyterhub.readthedocs.io/en/stable/reference/separate-proxy.html implies the proxy will allow access to a singleuser pod if the hub subsequently goes down. I guess the problem is the proxy service is actually two services, one for the hub and one for the servers.

I think /hub needs to have the base_url prefix if set, since this is defined in Jupyterhub (as opposed to some other apps where the service always runs under the same prefix and the proxy takes care of mapping URLs)

manics commented 4 years ago

The other way to look at it is the proxy is "just a proxy": each individual singleuser server and hub pod are all separate services, so the proxy health check should reflect only whether the proxy is available regardless of the backend services. This means it should present a nice error page to the user if the hub isn't available.

consideRatio commented 4 years ago

Current suggestion

We let the hub have a /hub/health http liveness and readiness probe, and the proxy have a /_chp_healthz liveness and readiness probe. This is a simple PR that looks very much like #1004.
We ensure we don't get stuck too long within init_spawners() somehow, which currently can take 30 seconds, or we increase add initialDelaySeconds: 30 or similar to the hub's livenessProbe.

Long term suggestions

We ensure to configure the proxy pods more reliably. They should be configured even though they are more than one, and they should be able to configure without being ready and not yet accepting traffic from a service. They could for example be ready whenever they have read from a configmap about what routes are available or similar.

consideRatio commented 4 years ago

I think /hub needs to have the base_url prefix if set, since this is defined in Jupyterhub (as opposed to some other apps where the service always runs under the same prefix and the proxy takes care of mapping URLs)

I'm not confident on what goes on here, but the default helm value we provide has a hub.baseUrl: /, and that configures c.JupyterHub.base_url. In a service template for the hub, I notice an annotation like this: prometheus.io/path: {{ .Values.hub.baseUrl }}hub/metrics.

With this in mind, I guess we should configure the probes to not specifically visit the path /hub/health but instead {{ .Values.hub.baseUrl }}hub/health. Do you think that makes sense @manics?

manics commented 4 years ago

@consideratio yes, that sounds right!

consideRatio commented 4 years ago

This works fine after bumping to latest JupyterHub in #1422

jupyterhub / zero-to-jupyterhub-k8s

Readiness/Liveness setup challenge #1357

Suggested fixes

Fix 1 - /health should go to /hub/health

Fix 2 - adjust liveness check params

Fix 3 - adjust timeouts blocking hub's tornado endpoint startup

Questions raised

Q1 - can users with a started singleuser servers continue to work with a hub pod down?

Q2 - do we avoid circular issues?

Q3 - what is the authorization flow for a user asking the proxy for access to its server?

Q4 - is the proxy decoupling itself from traffic?

Current suggestion

Long term suggestions