mozilla-services / Dockerflow

Cloud Services Dockerflow specification
Apache License 2.0
195 stars 28 forks source link

how to do endpoints for non-webapp services #71

Open willkg opened 9 months ago

willkg commented 9 months ago

For services that are not webapps, what does Dockerflow recommend we do for healthcheck endpoints?

For example, the Socorro processor is not a webapp and doesn't have anything to respond to HTTP, so there's nothing to implement healthchecks with.

Is it the case that all services must implement a webapp to handle Dockerflow healthcheck enpoints? Should we have something else for non-webapp services?

jwhitlock commented 5 months ago

On https://github.com/mozilla/fx-private-relay, we implemented a pair of Django management commands to implement a liveness check to detect stalled processes.

process_email_from_sqs.py is a long-running process that loops to poll a AWS Queue and processes any emails. It periodically writes a healthcheck file to disk with the timestamp and some data. Email is unpredictable, and the standard library email processing expects spec-compliant emails, so there are uncaught exceptions that cause the process to crash. The AWS client library has some built-in retry logic, so connection issues can appear as a stuck process.

check_health.py is a second management command that attempts to read the healthcheck file. If it doesn't exist, or there is an issue like the in-data timestamp is too old, it exits with an error code. If everything is copacetic, it returns with a 0 error code for success.

The process_email_from_sqs.py command is run as a Kubernetes deployment with several replicas. The check_health.py command runs as a liveness probe. The spec looks something like this:

spec:
  containers:
    - command:
        - python
        - manage.py
        - process_emails_from_sqs
      livenessProbe:
        exec:
          command:
            - python
            - /app/manage.py
            - check_health
        failureThreshold: 5
        initialDelaySeconds: 5
        periodSeconds: 6
        successThreshold: 1
        timeoutSeconds: 5

We have hundreds of liveness probe failures a day according to Sentry, but require several in a row to terminate a process. It is more common for a process to terminate due to a uncaught exception, but the liveness check does prevent zombie replicas from sticking around until the next deployment.

I'm negative on a webservice for each background process, but we could re-implement this as a webservice that runs process_emails_from_sqs.py in a fork, sends health data over a pipe, and serves the health data at /__heartbeat__, with a proper status code for a stalled process. I don't think it would make much sense to expose this webservice to the world, it would just be for making a background service look like a web service.