learningequality / ka-lite

KA Lite: lightweight web server for serving core Khan Academy content (videos and exercises) without needing internet connectivity
https://learningequality.org/ka-lite/
Other
457 stars 304 forks source link

Check if PID file is valid before refusing to start #5424

Closed benjaoming closed 7 years ago

benjaoming commented 7 years ago

Summary

Feedback from @case485:

One other thing I have noticed in locations like Haiti where we have frequent power outages is that on restart sometimes the kalite.pid file is not removed on a loss of power scenario and when the reboot starts kalite sees the kalite.pid file in the config directory and will not start until i manually remove it. I was going to add a force remove in one of the startup services to just be sure.

The solution would be to check if the PID in the file exists, and if indeed that process is a kalite process.

System information

This affects all previous releases of KA Lite since the kalite command was introduced.

benjaoming commented 7 years ago

Just emphasize that this is not a general bug. This only happens when another process accidentally inhabits the PID in the kalite.pid file.

    # PID file exists, but process is dead
    if not pid_exists(pid):
        if os.path.isfile(STARTUP_LOCK):
            raise NotRunning(STATUS_FAILED_TO_START)  # Failed to start
        raise NotRunning(STATUS_UNCLEAN_SHUTDOWN)  # Unclean shutdown

However, as PIDs are assigned sequentially, it is very likely that another process is occupying a PID from a previous boot.

benjaoming commented 7 years ago

Furthermore, we do already check if an active KA Lite server is running. If not, we also assume that KA Lite isn't running!

    # Timeout is 1 second, we don't want the status command to be slow
    conn = httplib.HTTPConnection("127.0.0.1", listen_port, timeout=3)
    try:
        conn.request("GET", PING_URL)
        response = conn.getresponse()
    except (timeout, socket.error):
        raise NotRunning(STATUS_NOT_RESPONDING)
    except (httplib.HTTPException, URLError):
        if os.path.isfile(STARTUP_LOCK):
            raise NotRunning(STATUS_STARTING_UP)  # Starting up
        raise NotRunning(STATUS_UNCLEAN_SHUTDOWN)
benjaoming commented 7 years ago

The essence is that if start() does not conclude that an active PID exists from kalite.pid, it will continue to start:

    try:
        if get_pid():
            sys.stderr.write("Refusing to start: Already running\n")
            sys.stderr.write("Use 'kalite stop' to stop the instance.\n")
            sys.exit(1)
    except NotRunning:
        pass
benjaoming commented 7 years ago

I did the following test to verify that the described scenario is invalid:

1) Put an invalid PID number in ~/.kalite/kalite.pid to check that it still starts:

✗ echo "123123123" > ~/.kalite/kalite.pid
✗ kalite start                           
Running 'kalite start' as daemon (system service)
Going to daemon mode, logging to /home/benjamin/.kalite/server.log

To access KA Lite from this machine, try the following address:
    http://127.0.0.1:8008/

2) Put a valid PID number in ~/.kalite/kalite.pid that does NOT refer to a valid kalite instance to check that it still starts:

✗ echo "16288" > ~/.kalite/kalite.pid
✗ kalite start
Running 'kalite start' as daemon (system service)
Going to daemon mode, logging to /home/benjamin/.kalite/server.log

To access KA Lite from this machine, try the following address:
    http://127.0.0.1:8008/

I'm closing this issue for now because I can't make any progress, it sincerely looks like the handling of kalite.pid is as good as it gets.

Advice

Firstly, you should of course check that KA Lite isn't somehow started twice :) But okay, I guess you probably have that covered.

In any other case, assuming that you're running ka-lite-raspberry-pi, you would have to look at the log output from systemd like so:

sudo journalctl -u ka-lite.service

In case it just shows that KA Lite is starting normally (and you know it hasn't), refer to the /home/pi/.kalite/server.log which will contain the log output from the daemon.