kelseyhightower / confd

Manage local application configuration files using templates and data from etcd or consul
MIT License
8.36k stars 1.41k forks source link

Add panic button #401

Open therc opened 8 years ago

therc commented 8 years ago

Allow users to define a key name that, in an emergency prevents all updates cluster-wide.

(From chapter 4 of Automation Gone Wild.)

bacongobbler commented 8 years ago

@therc can you please clarify this use case? What's wrong with killing the running process and restarting that would require such a scenario?

therc commented 8 years ago

If you have ten or a hundred machines running confd, killing and restarting processes manually is going to be a headache.

The use case is that you need to have a red button to stop all automation when you suspect it of being the cause or a trigger for an ongoing outage. This is standard procedure for services and configuration management at scale. Maybe your tools are dynamically configuring load balancers, maybe they're pushing quotas for your customers or throttling them, maybe something else. But before you investigate whether the bug is at the source, downstream or somewhere in the middle, which could take a long time, first of all you want to stop further damage from occurring.

You could even get fancy and allow the key to be at different levels in the hierarchy, to be selective in what changes you want to inhibit, but that's a bonus.

bacongobbler commented 8 years ago

I have a feeling that this is out of scope for confd. I'm assuming that if (hypothetically) you have hundreds of machines running confd, you likely have some form of a scheduler that would be able to stop these processes for you, which then you'll have a way to stop said job. For example, kubernetes has a concept of a replication controller that works across a set amount of replicas for a given job. Your big red button in that case is to scale down to zero then back up as you see fit.

In other words I don't see how this solution would be useful in the real world. Companies that deploy software use schedulers and/or monitors to deploy and manage jobs across machines. A scheduler would be your "big red button" in this case, which would kill the running process and restart as needed. Something like a "big red button" to halt a number of confd processes from templating sounds like a clustering problem that should be solved outside the system, not within it.

therc commented 8 years ago

A scheduler might work if you're deploying confd on its own. But even then, it's a very blunt instrument. Let's ignore the fact that mass terminations will add a lot of noise to logs (the last thing you need when troubleshooting live) or that the central scheduler and the local node manager are additional sources of latency, unpredictability and points of failure on top of your distributed lock service. At the moment, I'm more interested in the standalone setup, but in your Kubernetes example, I might have confd running as a sidecar container in the same pod as a backend. I don't know of a way to terminate just confd without taking down the whole service. I want to stop the bleeding, not kill the patient altogether. :-) I have worked at companies in the real world that use very large scheduling and monitoring systems. We still used big red buttons. I can probably come up with at least 4-5 services you interact with daily whose operations rely on multiple panic buttons. I implemented a few and used even more of those. When you are serving thousands or millions of requests per second, you have to have them, document them and test them regularly, for every layer of automation. Taking down jobs was always the very last and desperate resort. It would be nice to have a simple and easy way to always practice safe automation with confd. Without it, users will roll their own kludgy solution or won't bother. I don't know which is worse.

kelseyhightower commented 8 years ago

@therc Thanks for opening this issue. Would an authenticated HTTP endpoint to pause confd work?

therc commented 8 years ago

@kelseyhightower that requires a small HTTP server, plus now you're dealing with authentication. And you need to hit all confd instances, which is not so easy or reliable if you're in the middle of a crashloop fest. I think you can find examples of the pattern I mention (a file in Chubby) if you search the corporate wiki for "make it stop" pages.

crandles commented 8 years ago

What about utilizing check_cmd to make an external API call (maybe to the same backend your templates are sourced from). Return a non-zero code and your config will not be updated.

hubo1016 commented 6 years ago

@therc I think there is a way to implement this with confd. You may insert a remark like

# my-panic-button: {{ getv "/globalwide/panicbutton "0" }}

and use an external script as the check_cmd:

check_cmd: /usr/bin/my_panic_button.sh {{.src}}

In the shell:

grep "# my-panic-button: 1" $1 && exit 1
exit 0

So if the value is changed from 0 to 1, the check_cmd prevents further reload. After you change it back to 0, everything starts working again.