jupyterhub / binderhub

Run your code in the cloud, with technology so advanced, it feels like magic!
https://binderhub.readthedocs.io
BSD 3-Clause "New" or "Revised" License
2.54k stars 388 forks source link

Change Health API resply when node is SchedulingDisabled #1705

Open rgaiacs opened 1 year ago

rgaiacs commented 1 year ago

Consider the scenario of a small Kubernetes cluster (2 nodes). Node 1 runs binderhub API and Node 2 runs repo2docker and JupyterHub. When container image cleaning starts in Node 2, the is marked as SchedulingDisabled. Without any other node able to run repo2docker and JupyterHub, the health API should return unhealth.

cc @arnim

Steps to reproduce

$ curl https://notebooks.gesis.org/binder/health | python3 -m json.tool
{
    "ok": true,
    "checks": [
        {
            "service": "Docker registry",
            "ok": true
        },
        {
            "service": "JupyterHub API",
            "ok": true
        },
        {
            "service": "Pod quota",
            "total_pods": 32,
            "build_pods": 0,
            "user_pods": 32,
            "quota": 40,
            "ok": true,
            "_ignore_failure": true
        }
    ]
}
$ kubectl get nodes
NAME             STATUS   ROLES           AGE   VERSION
spko-css-app03   Ready    <none>          34d   v1.26.3
svko-ilcm03      Ready    control-plane   48d   v1.26.3
$ kubectl cordon spko-css-app03
node/spko-css-app03 cordoned
$ kubectl get nodes
NAME             STATUS                     ROLES           AGE   VERSION
spko-css-app03   Ready,SchedulingDisabled   <none>          34d   v1.26.3
svko-ilcm03      Ready                      control-plane   48d   v1.26.3
$ curl https://notebooks.gesis.org/binder/health | python3 -m json.tool

Observed Output

{
    "ok": true,
    "checks": [
        {
            "service": "Docker registry",
            "ok": true
        },
        {
            "service": "JupyterHub API",
            "ok": true
        },
        {
            "service": "Pod quota",
            "total_pods": 33,
            "build_pods": 0,
            "user_pods": 33,
            "quota": 40,
            "ok": true,
            "_ignore_failure": true
        }
    ]
}

Expected Output

{
    "ok": false,
    "checks": [
        {
            "service": "Docker registry",
            "ok": true
        },
        {
            "service": "JupyterHub API",
            "ok": true
        },
        {
            "service": "Pod quota",
            "total_pods": 33,
            "build_pods": 0,
            "user_pods": 33,
            "quota": 40,
            "ok": false,
            "_ignore_failure": true
        }
    ]
}
minrk commented 1 year ago

This might be a little tricky to implement. But I suppose the builder class cloud have a "builders available" method? The abstractions make it quite tricky, because how to check if it's true will depend on how it's deployed (i.e. in the helm config, outside the BinderHub config). You'll need to know which nodes to check for their scheduling status, if any.