how to do endpoints for non-webapp services

On https://github.com/mozilla/fx-private-relay, we implemented a pair of Django management commands to implement a liveness check to detect stalled processes.

process_email_from_sqs.py is a long-running process that loops to poll a AWS Queue and processes any emails. It periodically writes a healthcheck file to disk with the timestamp and some data. Email is unpredictable, and the standard library email processing expects spec-compliant emails, so there are uncaught exceptions that cause the process to crash. The AWS client library has some built-in retry logic, so connection issues can appear as a stuck process.

check_health.py is a second management command that attempts to read the healthcheck file. If it doesn't exist, or there is an issue like the in-data timestamp is too old, it exits with an error code. If everything is copacetic, it returns with a 0 error code for success.

The process_email_from_sqs.py command is run as a Kubernetes deployment with several replicas. The check_health.py command runs as a liveness probe. The spec looks something like this:

spec:
  containers:
    - command:
        - python
        - manage.py
        - process_emails_from_sqs
      livenessProbe:
        exec:
          command:
            - python
            - /app/manage.py
            - check_health
        failureThreshold: 5
        initialDelaySeconds: 5
        periodSeconds: 6
        successThreshold: 1
        timeoutSeconds: 5

We have hundreds of liveness probe failures a day according to Sentry, but require several in a row to terminate a process. It is more common for a process to terminate due to a uncaught exception, but the liveness check does prevent zombie replicas from sticking around until the next deployment.

I'm negative on a webservice for each background process, but we could re-implement this as a webservice that runs process_emails_from_sqs.py in a fork, sends health data over a pipe, and serves the health data at /__heartbeat__, with a proper status code for a stalled process. I don't think it would make much sense to expose this webservice to the world, it would just be for making a background service look like a web service.

mozilla-services / Dockerflow

how to do endpoints for non-webapp services #71