mojombo / god

Ruby process monitor
http://godrb.com
MIT License
2.21k stars 543 forks source link

We need a "down" state #156

Open giddie opened 10 years ago

giddie commented 10 years ago

I find it hard to believe this hasn't come up before: I need god to keep some processes alive and not others. For those that it doesn't need to keep alive, I'd like it to tell me whether they're up or down. However, God currently doesn't support a "down" state, or anything analogous. I'm genuinely confused as to how this wasn't in god from the beginning?

sfgeorge commented 10 years ago

I believe that the unmonitored state is somewhat analogous, but a bit more ambiguous.

Executing god stop my-watch ends the watched process and puts it in the "unmonitored" state.

Similarly, god unmonitor my-watch puts it in the "unmonitored" state, but without ending the process.

In both cases, god's unmonitor state can be interpreted as "The process could be down, it could be up. god is not currently watching it at all."

One can reference the pid file for their watch and manually run ps -p $(cat /var/run/my-service.pid) as a weak guarantee that a process with that PID is currently running. Just beware that god will not delete a pid file when you ask it to stop a service.

I agree that a down state would be useful. unmonitor is too ambiguous, as it covers 2 use cases.

sfgeorge commented 10 years ago

...I'll also say that I vote for a state name of "stopped". When I read "down" I think "Was this a planned or unplanned down?". I feel that "stopped" tells me "This is because I told it to stop."

donovanbray commented 10 years ago

If a service is flapping it can be put in an unmonitored state, which to me is different than an administratively stopped state. So I +1 on stop. I also agree with 'stopped' instead of 'down'.

giddie commented 10 years ago

I agree that "stopped" is clear, but there's a need for consistency: we have "start", which leads to "up", and "stop", which ought to lead to "down". Otherwise, "up" should become "started" to match "stopped".

I admit I find myself confused by god: it's a state machine, and state stansitions are largely user-defined, and yet state names are not. An arbitrary state machine would be great, or a process state tracker (à la systemd) is also good, but god seems to be a little stuck in the middle at the moment. I'm sure the kinks will be ironed out :)

skull-squadron commented 10 years ago

Just stumbled on this while hacking foreman_god to support tmp/stop.txt.

:+1: on a temporary, maintenance-mode state that can transition to up. It seems unclear if the existing stop is a permanent state, it seems like it is, but ICBW.

I [2014-02-25 14:41:09]  INFO: ...-web-1 sent SIGTERM
I [2014-02-25 14:41:10]  INFO: ...-web-1 process stopped
I [2014-02-25 14:41:10]  INFO: ...-web-1 moved 'up' to 'stop'
(refuses to transition from 'stop' to 'up')

What might help could be specific, clear, stipulative definitions of what the user wants and the observed behavior. Maybe an ASCII transition table: state, user wants, observed. Either way, looks like I'm going to be hacking god too. :)

skull-squadron commented 10 years ago

Update: got something ghetto working. Running rake test to see what it breaks.

skull-squadron commented 10 years ago

Btw, here's a hack to donkey punch god right now:

::God::Watch::VALID_STATES << :stop
giddie commented 10 years ago

How does :stop as a state interact with "stop" as an action? This is the stuff that confuses me: when you define a transition to :stop, it doesn't currently transition to a :stop state, but performs the "stop" action and transitions to a different state (:unmonitored, usually). The "stop" action really ought to be an action that happens when transitioning from :up to :down (or :running to :stopped). With a :stop state, does the process stay in :stop after it's stopped? What happens if it fails to stop?

donovanbray commented 10 years ago

If you issue an unmonitor command it transitions to unmonitored. So it would seem to me to be more natural that issuing a stop transitions to stopped.

In the case of flapping it would seem we could then have two choices. Configure the check to go unmonitored with a possibility of becoming monitored again after a timeout, or configure it to go to stopped where there wouldn't be a possibility of it becoming monitored again. Not sure operationally that's a way I would choose to deal with a flap but I would rather have the option than not.

There are some situations that you only want to retry starting a service once; subsequent flapping restarts may assume more risk than you are willing to tolerate.

skull-squadron commented 10 years ago

The :stop hack keeps an app down, which is what was desired for a particular use case. (Adding passenger-like tmp/stop.txt to god foreman)— Sent from Mailbox for iPhone

On Thu, Feb 27, 2014 at 7:20 AM, Donovan Bray notifications@github.com wrote:

If you issue an unmonitor command it transitions to unmonitored. So it would seem to me to be more natural that issuing a stop transitions to stopped. In the case of flapping it would seem we could then have two choices. Configure the check to go unmonitored with a possibility of becoming monitored again after a timeout, or configure it to go to stopped where there wouldn't be a possibility of it becoming monitored again. Not sure operationally that's a way I would choose to deal with a flap but I would rather have the option than not.

There are some situations that you only want to retry starting a service once; subsequent flapping restarts may assume more risk than you are willing to tolerate.

Reply to this email directly or view it on GitHub: https://github.com/mojombo/god/issues/156#issuecomment-36252354

donovanbray commented 10 years ago

I would like to additionally see the stopped state persist through god restarts. If I have something administratively down I want it to stay down. I watch god with upstart with respawn because at one time I had a problem with god occasionally segfaulting. If I stop something and god segfaults or otherwise disappears perhaps to reload the config I don't want the administratively down apps restarting themselves. When they were unmonitored this was acceptable. Because unmonitored is a looser contract than is administratively down.

The current behavior does cause me issues. My app servers monitor nginx unicorn and resque. When a problem happens with the code I have to fix both unicorn and resque; the easiest way for me to pull this particular app server out of the front end loadbalancer is to shut down nginx. god stop nginx.

Now I need to stop god so I can fix the processes without god interfering. service god stop. (Upstart with respawn). Now I'm free to kill everything clean up and undo the damage. The easiest way to bring back the bevy of monitored processes is to start god back up (also because I want to guarantee god has reloaded its config in case that was part of the problem. ). Here's where the rub of where unmonitored bites me. Nginx will be restarted automatically when I issue service god start; and it starts within a second or so; however the unicorn servers take a good 120 seconds to spin up so during that time some of my http requests are being eaten and throwing errors. (I've mitigated this by having my front end lb retry 502 responses on another upstream) however I don't like the behavior because it could be better mitigated by nginx staying down even after a god restart. That allows me to verify the unicorns are good to go, and I can choose when to open this app servers front door.

sfgeorge commented 10 years ago

I suggest that these be 2 completely separate feature requests:

  1. Add a stopped/down (+ potentially a "stopping" state)
  2. Support state persistence if god is restarted
donovanbray commented 10 years ago

from a development standpoint I would agree, however you really will take admins by suprise if they THINK stopped/down means it will stay stopped, and thats not true if you need to restart god to pick up a new configuration.

sfgeorge commented 10 years ago

If you need to reload configuration, why not just invoke god load <config-file>?

donovanbray commented 10 years ago

because god load doesn't remove watches that have been removed; so our scripts always force a restart of the god pid.

sfgeorge commented 10 years ago

That actually highlights one of the tricky issues with persisting state. They'd have to deal with the persisted state table including watches that are no longer defined in the watch config upon startup. And vice-versa. That doesn't sound like a small task.

For the time being, is it feasible to run a series of god remove <task or group name> following a god load ...?

I still maintain that we have two, disparately distinct, features that we're discussing here. Let's not muddy them up together.

skull-squadron commented 10 years ago

Persistence can be defined by a user in a god config using poll checker classes.  See forman_god god_config.rb on how to implement this.  Note: there's an unfiled bug where it goes :unmonitored > :up >  :stop if tmp/stop.txt exists (should be :unmonitored > :stop)— Sent from Mailbox for iPhone

On Thu, Feb 27, 2014 at 7:45 AM, Donovan Bray notifications@github.com wrote:

I would like to additionally see the stopped state persist through god restarts. If I have something administratively down I want it to stay down. I watch god with upstart with respawn because at one time I had a problem with god occasionally segfaulting. If I stop something and god segfaults or otherwise disappears perhaps to reload the config I don't want the administratively down apps restarting themselves. When they were unmonitored this was acceptable. Because unmonitored is a looser contract than is administratively down. The current behavior does cause me issues. My app servers monitor nginx unicorn and resque. When a problem happens with the code I have to fix both unicorn and resque; the easiest way for me to pull this particular app server out of the front end loadbalancer is to shut down nginx. god stop nginx.

Now I need to stop god so I can fix the processes without god interfering. service god stop. (Upstart with respawn). Now I'm free to kill everything clean up and undo the damage. The easiest way to bring back the bevy of monitored processes is to start god back up (also because I want to guarantee god has reloaded its config in case that was part of the problem. ). Here's where the rub of where unmonitored bites me. Nginx will be restarted automatically when I issue service god start; and it starts within a second or so; however the unicorn servers take a good 120 seconds to spin up so during that time some of my http requests are being eaten and throwing errors. (I've mitigated this by having my front end lb retry 502 responses on another upstream) however I don't like the behavior because it could be better mitigated by nginx staying down even after a god restart. That allows me to verify the unicorns are good to go, and I can choose when to open this app servers front door.

Reply to this email directly or view it on GitHub: https://github.com/mojombo/god/issues/156#issuecomment-36255334

eric commented 10 years ago

because god load doesn't remove watches that have been removed; so our scripts always force a restart of the god pid.

That is no longer the case (I'm not sure if it was in the past). There is a new-ish optional parameter to specify what should happen to watches that have been removed. I chose to have the watches stopped when they are removed.

donovanbray commented 10 years ago

it most definitely was in the past, I've been using god for a very long time, my scripts and god configs have been stable since I had to make those adjustments. I like the idea of maybe using a poll class to keep something down (with or without the addition of this feature). However thats really beside the point.

I've lived with god as-is for a long time, I don't NEED a down state (which is probably why god wasn't created with one). My point and example was to show that if you are creating a new state of stopped/down, that users particularly those unfamiliar with the history of god will have a different expectation. Like I said before, something going to unmonitored now and being started after a god restart is forgivable, unmonitored doesn't have as implicit an expectation as does stopped.

Something going from stop to start just because god restarted is going to catch people off guard. It's a semantics thing. It's a lesson I've learned many times doing development. Put things where people expect them, make them behave the way a naive user would expect them to; else you will have problems when you surprise people with things.

What I believe is that a naive user will likely have the expectation that something in a stopped state will stay stopped god restart or no. Naive users won't yet know about setting up pollers, or even consider the ramifications of a god load or god restart. What they will do is file a bug report when it doesn't meet their expectations.

Ultimately whomever is writing the code wins, this is the last time I will mention it, but if you are going to the effort of adding a down/stopped state, then add the extra functionality that actually makes the feature more valuable than just keeping it as is and living with unmonitored.

giddie commented 10 years ago

As far as I can tell, god is capable of figuring out the state of any given Watch from its pid file. There's no reason for god to auto-start anything unless it's configured to restart it on failure. If god starts and notices that a process is :stopped, it has no reason to :start it unless there is a :stopped => :start transition defined. The watch will remain in the :stopped state until it is manually started.

Similarly, if god starts and discovers that a process is :running, it has no reason to stop it, but should simply report its state, and start to act according to the state transitions that have been defined (e.g. monitoring RAM usage).

eric commented 10 years ago

With the improved behavior around god load, it may be that restarting god on a regular basis isn't necessary anymore. Maybe it would be worth documenting best practices related to this sort of stuff...

Does monit or upstart or any of those have persistent state recorded for if you stop or start a service (that is separate from the state that something is disabled or enabled)?

skull-squadron commented 10 years ago

This :horse: has left the :house:

The couple decades of sysadmin in me advises against change for change's sake, or to take on every possible edge case that can be solved with a simple hack.

In a VPS (say Rack app PaaS provider) scenario, customers would not have access, so passenger-like behavior (restart.txt / stop.txt) or a successful git push is desirable. In that case, something like custom poll conditions and god config would work.

If it's bare metal shop running their own gear, that's an entirely different set of use cases. I'm a big fan of simple, well-defined behavior, so we use runit.

a few ways to set services on/off permanently:

Daemontools/Runit:

Upstart:

Can we close this?

giddie commented 10 years ago

Huh? I think I missed something: why would we close this? It sounds like you're saying it's basically not worth the effort because it's easier to hack it. That may be practical with my sysadmin hat on (this is what I've had to do), but it's not something we should be content with if we want software to improve.

Some random thoughts and further ramblings on my usecase:

I don't know upstart, but I know Monit and systemd. Monit will persist the "monitored" state of a process. Systemd persists all state by design when it is restarted itself, although bear in mind that it considers a system reboot to be a change of "target", which is totally different.

Monit has several available actions when a service goes down, including e-mail alerts and restarting the service. For some processes, I just want to be notified if RAM consumption is excessive, or if filesystem usage reaches a certain threshold. However, it doesn't deal well with a dynamic pool of workers, such as for Resque.

Systemd can handle dynamic pools better, but has almost no support for alerts or monitoring of RAM usage etc... It does have keep-alive functionality, though.

God makes dynamic pools crazy easy, but in my opinion could do with some work on its process state and transition model to make it more predictable and logical. It's also not clear to me if it's aimed at monitoring and alerts (like Monit), or process management (like systemd).

At the moment, I have all the processes required for my app (resque workers and redis) managed by god running as my app's user, which is started and kept alive by systemd, and monitored by monit (so I can be alerted quickly if something breaks and god can't start.) Sounds complex, but actually makes things easier for me.

celesteking commented 9 years ago

What daemontools? We're Enterprise with RHEL. Serious business, not some ubuntu toy.