informatics-lab / aml-jupyterhub

Code work and experiments to integrate Azure Machine Learning with JupyterHub.
6 stars 4 forks source link

Fix JupyterHub shutdowns following stopping of user servers #46

Closed consideRatio closed 3 years ago

consideRatio commented 3 years ago

Hi @pierocor, @nbarlowATI, and @tam203!

You had an issue with JupyterHub shutting down when a user server was stopped. This was because JupyterHub received a SIGTERM signal even though why that happened still isn't clear to me. The SIGTERM signal followed calling .terminate() on a process created with the Python library multiprocessing.

I think I've found a one-line fix to this issue: to use .kill() instead of .terminate().

I don't yet understand why JupyterHub receives a SIGTERM in the first place by calling .terminate(), and I think that is the root cause of the issue. With this fix, I believe we avoid because we stop sending SIGTERM to our separate processes. I also believe that we don't run into a similar problem with SIGKILL because I don't think it is passed to the process for interpretation as it is associated with forceful termination of the process rather than a "please terminate" request.

Using .kill(), the redirection webserver for the user server is killed while JupyterHub keeps running, so this issue seem resolved.

Reference

Here are some of the notes from our last chat.

test_redirector_jupyterhub_config.py

During debugging I used this jupyterhub config file to reproduce the issue with less overhead.

"""
This jupyterhub config file have been used to debug the starting and stopping of
separate processes for each user server. These processes were running a
primitive webserver responding with 302 redirects to the actual server.

There have been an issue where calling terminate() on the started process as
part of stopping the spawned user server led to JupyterHub receiving a SIGTERM
and shutting down. Why JupyterHub stops wasn't clear, but the goal was to just
stop the process running a primitive webserver responding with 302 redirects.

This config contains a dummy setup with logic stripped from redirector.py to
test a user server start/stop cycle, where stop has led to the unwanted
termination of JupyterHub itself.

The test cycle I've used was like this.

    # start jupyterhub 
    jupyterhub --config test_redirector_jupyterhub_config.py

    # visit jupyterhub and start a server
    # (ignore redirection loop)
    # visit jupyterhub again and stop the server
    # observe jupyterhub receiving a SIGTERM and no longer responding

    # cleanup after crash
    pkill -9 jupyterhub

While a lot of unknown remains, I concluded that using .kill() instead of
.terminate() on the created process avoided JupyterHub observing a SIGTERM
signal and shutting down, which was the unwanted behavior.
"""

import asyncio
import http.server
import multiprocessing
import socketserver

from jupyterhub.spawner import Spawner

class DummyRedirectSpawner(Spawner):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.redirect_server = None

    async def start(self):
        url = "https://google.se"
        route = RedirectServer.get_existing_redirect(url)
        if route:
            self.log.info("Existing route to compute instance found.")
        else:
            self.log.info("Creating route to compute instance.")
            self.redirect_server = RedirectServer(url)
            self.redirect_server.start()
            await asyncio.sleep(1)
            route = self.redirect_server.route
            self.log.info("Route to compute instance created.")
        return route

    async def stop(self):
        if self.redirect_server:
            self.log.info(f"Stopping the redirect server route: {self.redirect_server.route}.")
            self.redirect_server.stop()
            self.redirect_server = None

    async def poll(self):
        if self.redirect_server:
            return None
        else:
            return 0

def redirect_handler_factory(url):
    class RedirectHandler(http.server.SimpleHTTPRequestHandler):
        def do_GET(self):
            self.send_response(302)
            self.send_header('Location', url)
            self.end_headers()

    return RedirectHandler

def _create_server(url, port):
    port = port
    with socketserver.TCPServer(("", port), redirect_handler_factory(url)) as httpd:
        print("serving at port", port)
        httpd.serve_forever()

class RedirectServer:
    _start_port = 9001
    _redirects = {}

    @classmethod
    def _get_free_port(cls):
        port = cls._start_port
        taken = list(cls._redirects.values())
        while port in taken:
            port += 1
        return port

    @classmethod
    def get_existing_redirect(cls, url):
        port = cls._redirects.get(url)
        return ("0.0.0.0", port) if port else None

    def __init__(self, redirect_to_url):
        self.url = redirect_to_url
        super().__init__()

    def start(self):
        self.port = RedirectServer._get_free_port()
        self.server_process = multiprocessing.Process(
            target=_create_server,
            name=f"redirector-port-{self.port}",
            args=[self.url, self.port],
            daemon=True,
        )
        self.server_process.start()

        RedirectServer._redirects[self.url] = self.port

    def stop(self):
        del RedirectServer._redirects[self.url]
        # Using .kill() instead of .terminate(), as the latter apparently leads
        # to JupyterHub observing a SIGTERM signal which causes it to shut down.
        self.server_process.kill()

    @property
    def route(self):
        return ("0.0.0.0", self.port)

c.JupyterHub.authenticator_class = "dummy"
c.JupyterHub.spawner_class = DummyRedirectSpawner
c.JupyterHub.redirect_to_server = False