tpitale / mail_room

Forward mail from gmail IMAP to a callback URL or job worker, simply.
MIT License
195 stars 51 forks source link

Add liveness health check support #116

Open stanhu opened 3 years ago

stanhu commented 3 years ago

When MailRoom is run in Kubernetes, we have found occasions where MailRoom appears to have attempted to stop running, but Net::IMAP is stuck waiting for threads (https://github.com/ruby/net-imap/issues/14).

This commit adds an HTTP liveness checker to enable detection of a terminated MailRoom pod.

tpitale commented 3 years ago

Could this be accomplished if we added something that responded to SIGINFO or something like that?

stanhu commented 3 years ago

I think SIGINFO is a BSD construct, so this would only be supported in macOS or FreeBSD. Using signals in general isn't the most cross-platform friendly way of doing monitoring.

tpitale commented 3 years ago

It doesn't need to be SIGINFO specifically. Linux supports signals generally. And I think there are a lot of other issues with mailroom that would prevent running on windows anyway.

tpitale commented 3 years ago

In general, I'm apprehensive about adding webrick and a web service into the mix.

I'd rather have another repo/project that could provide a web interface of some sort, that was able to query mailroom through some other means 🤔

tpitale commented 3 years ago

Would it be preferable instead to have mail_room report out, like a heartbeat? I feel like that would require less code, generally be safer.

stanhu commented 3 years ago

In this case, I think that is more complex because now you need to have a separate process that determines whether the process is alive. HTTP liveness probes are a common practice in Kubernetes: https://www.magalix.com/blog/kubernetes-and-containers-best-practices-health-probes

As for push vs pull for metrics, Prometheus has written extensively why they prefer a pull model for monitoring, particular for detecting a downed service:

  1. https://prometheus.io/docs/introduction/faq/#why-do-you-pull-rather-than-push
  2. https://prometheus.io/blog/2016/07/23/pull-does-not-scale-or-does-it/?utm_source=thenewstack&utm_medium=website&utm_campaign=platform.
tpitale commented 2 years ago

Okay. Re-reviewing this.

If we're going to do it, I'd like to change a few things.

  1. I'd like to be more explicit that this is an HTTP health check, so that if we add other kinds of health checks down the line, this won't have to move in the configuration or naming
  2. I'd like the default not to be nil, but a NoopHealthCheck, been trying to keep from using nil configuration and the &. pattern

I can leave more specific comments in the code, if that is helpful. Sorry I didn't get back to this for months and months (new baby).

stanhu commented 2 years ago

@tpitale Congrats on your new arrival! I've updated this pull request; let me know what you think.