Standalone Pods found when Hub pod restart

hh-cn commented 8 months ago

Bug description

We have observed a strange phenomenon in the cluster. After the hub restarts, the number of running servers viewed in the admin page status does not match the number of user pods obtained by kubectl get pods. In fact, there are many user pods present, but their servers seems not running on the admin side.

How to reproduce

Deploy a jupyterhub cluster on k8s using kubespawner with default config
Login more than one user
Restart hub pod, can use kubectl rollout restart the hub deployment
See admin page, found only one user has his runnnig server, but more user pods found in k8s namespace

Speculation about the problem

I reviewed the code, and I believe the issue lies here.

when hub start, it will trying to resume pods/users stats from db

        self.log.debug("Loading state for %s from db", spawner._log_name)
        # signal that check is pending to avoid race conditions
        spawner._check_pending = True
        f = asyncio.ensure_future(check_spawner(user, spawner.name, spawner))
        check_futures.append(f)

    # it's important that we get here before the first await
    # so that we know all spawners are instantiated and in the check-pending state

    # await checks after submitting them all
    if check_futures:
        self.log.debug(
            "Awaiting checks for %i possibly-running spawners", len(check_futures)
        )
        await asyncio.gather(*check_futures)

the check function check_spawner actually call the spawner's poll method

    await self._start_watching_pods()

    ref_key = f"{self.namespace}/{self.pod_name}"
    pod = self.pod_reflector.pods.get(ref_key, None)

kubespawner's poll will use a shared Pod ResourceReflector to see if the user pod exists, but the reflector is setted to a init object before it has been finished start up

        self.__class__.reflectors[key] = current_reflector = reflector_class(
            parent=self,
            namespace=self.namespace,
            on_failure=on_reflector_failure,
            **kwargs,
        )
        await catch_reflector_start(current_reflector.start())

        if previous_reflector:
            # we replaced the reflector, stop the old one
            asyncio.ensure_future(previous_reflector.stop())

        # return the current reflector
        return current_reflector

so pod_reflector contains a empty set when hub trying to fetch user's pod, then got None, which cause the hub says XXX USER appears to have stopped while the Hub was down and delete its server from db.

But the user pod exists in fact, and when pod_reflector start() finished , it should be right to return a Correct State.

Your personal set up

OS: ubuntu 20.04
Version(s): jupyterhub 4.0.2 kubernets 1.26 python 3.11.4
Full environment

# paste output of `pip freeze` or `conda list` here

Configuration

```python # jupyterhub_config.py ```

Logs

``` # paste relevant logs here, if any ```

welcome[bot] commented 8 months ago

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively. welcome You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:

consideRatio commented 8 months ago

I think this is https://discourse.jupyter.org/t/how-to-cleanup-orphaned-user-pods-after-bug-in-z2jh-3-0-and-kubespawner-6-0/21677, its written about in the changelog for the minor relesses 3.1 3.2 and 3.3

consideRatio commented 8 months ago

Amazing issue writeup @hh-cn!

jupyterhub / kubespawner