Rock solid test deployments and back-outs

msmark commented 8 years ago

This issue is based on a new feature request I raised in email on the quattor-discuss mailing list last week. It touches many different components, e.g. Aquilon, cdp-listend, ncm-cdispd, ncm-ncd, CAF and lots of NCM components. However, I am creating this umbrella issue to describe the high-level request from which sub-issues may be readily created.

The ultimate goal is: we need rock solid test deployments and rock solid back-outs. More specifically, to provide a way to allow an operator who is managing hundreds or thousands of hosts in a domain to:

Be confident in the knowledge of the effects of deploying a change, before actually making any live changes, in terms that he or she can understand.
To be able to quickly and 100% reliably remove a change if it is deployed and found to cause a breakage.
Be able to identify a list of changes that have recently been made on any single host, or to identify which change caused a particular artefact, that can be easily tracked back to the deployment of a sandbox.
Be able to list all steps required to back out a change on an individual host that can be used in an emergency and when there is no automatic recovery available, for example only console access is available.

This can be broken down into the following requirements:

aq deploy --source <sandbox> --target <domain> --compile should atomically deploy git changes from a sandbox to a domain and compile the domain. If the compile fails, the git changes are removed. If it succeeds, when host profiles are sent across to each host, the fact that this is a deploy and the change ID involved must be communicated with ncm-cdispd. The command returns a unique change ID.
aq undeploy --change <id> --compile should atomically remove the git changes identified by the change ID and compile the domain. Not allowed without the --compile option. When host profiles are sent across to each host, the fact that this is an undeploy and the change ID involved must be communicated with ncm-cdispd. If this is not the last deploy made to a domain, the command will fail listing the other change IDs that have been subsequently applied. An additional flag --redeploy may be provided, which indicates that all change IDs subsequently applied must be undeployed and then re-deployed again once the selected change ID has been removed.
Both aq deploy --compile and aq undeploy --compile must support a --test option. This option requests a test deployment. This deploys to (or undeploys from) a copy of the domain, not the live domain. The profiles are compiled and shipped out to every host in the domain with a new flag indicating to ncm-cdispd that this is a test profile only. What the host does when it receives a test profile is discussed below.
When ncm-cdispd receives a test profile, as described above, it puts it in a different cache than it would normally use for live profiles. Then it runs all NCM components with --noaction (only NCM components that support NoAction), logging output to a test log -- not the normal log. Then it deletes the test profile.
All NCM components should wrap their commands inside a CAF::Executor object. The object encapsulates each individual task involved, with the change ID that links them all together, as well as the actions required to undo the change and a human friendly description. This object becomes a key part of information recorded by CAF::History but is also used by the component to execute a change. We should also have a CAF::Evaluate object in which we can enclose arbitrary code that makes a change but that cannot be expressed by another CAF method. See example CAF::Executor object below.
CAF::Executor also takes a human friendly description of the change being made, as well as the steps required to undo the change. CAF::Process will additionally need to capture the command needed to undo the change in order to inform CAF::Executor of the same. CAF::FileWriter and CAF::FileEditor can automatically ensure that CAF::Executor has an undo capability by backing up the file before and after it makes any changes (see also point 10 below). If there is no way to undo a change, there should be a way to flag this up the framework. See below for an example visual representation of a CAF::Executor object.
Need aq show testlog --change <id> or similar command that collects the output of all of the test profile runs from every host that ran --noaction as a result of item 4 above, with a succinct but user readable list of tasks performed by each component (the CAF::Executor objects). With the --undo flag will show the undo commands from the CAF::Executor objects instead. By succinct, this means one line per task across all CAF::Executor objects in scope (see example CAF::Executor object below), but with the ability to drill down into more detail if needed (e.g. a --verbose option).
When ncm-cdispd receives a new profile, it records whether this is a live profile or a test profile, and the associated change ID. It passes this information onto ncm-ncd.
When ncm-ncd executes components, it records the exact order in which each component is run with the change ID. Each CAF::Executor or CAF::History object created during a deploy or undeploy is logged. If doing an undeploy, it computes the appropriate order to undo changes based on the CAF::Executor objects that were created during the deploy.
The Perl NCM components need to use CAF::Executor objects to wrap every change they make. If a file is being changed, the original file and a copy of the new file are stashed in a different directory. This is used by CAF::FileWriter or CAF::FileEditor to check and compute an appropriate rollback during an undeploy.
When undeploying a change, ncm-cdispd or ncm-ncd will play rollback commands logged in CAF::Executor objects. It will not expect the NCM component to understand how to revert the state. This is because an NCM component is only good at handling what it thinks is currently in scope. Its view of the world changes as conditional logic within the Perl code routes down different code paths, and as new versions of NCM components are delivered. By recording an exact list of undo commands at the time that the change is made, it can be guaranteed that changes can be successfully reversed even if the NCM component code has been subsequently modified or removed. See the CAF::Executor example below, note that in many cases recording which CAF method was used and the arguments are enough for the history. After a successful undeploy, re-running ncm-ncd with the rolled back (now current) profile should be the same as a no-op.
Need a command that can be run on a host to list all changes that were made by a single change ID, and also a list of how to manually undo them if required in an emergency.

Here is an example visual representation of a CAF::Executor object for a component that wants to change a file and then send a HUP signal to a process. You'll see it essentially groups together a bunch of other CAF methods:

CAF::Executor
-       Change ID #123
-       Task #1 => {
-             execute => {
-                   object => CAF::FileWriter(<config_file>, <backup_file>)
-                   desc   => "Add <new_item> to <config_file>"
-             }
-             undo => {
-                   object => same CAF::FileWriter (restore <backup_file>)
-                   desc   => "Restore <backup_file> to <config_file>"
-             }
-       }
-       Task #2 => {
-             execute => {
-                   object => CAF::Process(["pkill", "-HUP", "<process>"])
-                   desc   => "Send HUP signal to <process>"
-             }
-             undo => {
-                   # same as execute
-             }
-
-       }
-       Task #3 => {
-             execute => {
-                   object => CAF::Evaluate(sub {<arbitrary code>})
-                   desc   => "Some custom description"
-             }
-             undo => {
-                   object => CAF::Evaluate(sub {<arbitrary code>})
-                   desc   => "Some custom reverse description"
-                   reverse => 1   # When undoing, tasks marked with this flag are executed
-                                  # in reverse order
-             }
-       }

msmark commented 8 years ago

Further to this, and in response to the inevitable question "can't we just reapply the old host profile in order to rollback a change?", let me simplify the problem definition by drawing focus on a single NCM component, but bear in mind that we will be able to apply this easily across multiple components.

A component has CIs that are in scope, and there will be lots more out of scope. By in scope, I mean specifically that the Perl code in /usr/lib/perl/NCM/Component/<mycomponent>.pm explicitly configures it. As long as all of the changes from one host profile to another host profile touch CIs that are in scope, then we can (assuming no bugs) move from any host profile to any other host profile just by re-running the component. This is CM utopia, but it is an impractical and unachievable one, and this is why:

One CI that may have been in scope once, can subsequently go out of scope. This may manifest itself in a couple of ways: a) conditional logic in the NCM component Perl code puts it out of scope or b) an upgrade or downgrade in the NCM component Perl code puts it out of scope. Once it is out of scope, the Perl code typically ignores it. This is why "rock solid rollback" cannot be implemented within the Perl code itself. We can have helpers within the Perl code, but the larger Quattor framework must be responsible for handling rollback to be able to address items that go out of scope due to a) and b).
NCM component complexity: one might argue that number 1 above can be solved within the NCM components themselves given more strict development practices, but the harsh reality is that some components are complex. Putting the onus on every NCM developer to foresee all possible code paths within their module and to provide a reliable and tested means of getting from any configuration state to any other configuration state is unrealistic. It is better to provide the developer the tools to inform the framework, and have the framework handle the complexity in one place.

If we have independently recorded every change made by every NCM component, in order, indexed by a change number that can be traced back to a single aq deploy, we transcend bugs in NCM components and we transcend the problems of upgrading or downgrading the NCM components as well. If we additionally record both before and after states for CIs, we can determine quite safely whether we have all of the information at our disposal to automatically roll back the deployment of a sandbox. I don't think it would be too hard to get to a point where we could have the framework roll back X number of changes, and then optionally reapply a subset of those changes, in order to unpick a bad change that we need to eject from our environment in an emergency.

stdweird commented 8 years ago

@msmark thanks for detailed explanation. i feel a bit relunctant to start typing comments in here :smile:

can you open an issue in ncm-cdispd/ncm-ncd or CCM wrt the test profile part? (if possible clarify the following: does the test profile have some (meta)data in the profile itself to mark it as testonly (in the sense that you would need to compile a different one to use it in production?) if not, how do you propose it can be seens as a test profile?) it looks like an issue we can more easily resolve and factor out from the larger discussion.

my main concern is that this is not the next step for quattor, but already a few steps in the future. i would really like to see the components cleaned up and use more CAF first. from your proposal, it looks like you want to skip all that and wrap it in something magical. it would be nice if we could isolate some piece of work and try to make a proof of concept (ideally something around CAF::FileWriter, or a trivial component like filecopy).

msmark commented 8 years ago

The test profile part is tied in with the grander design. Once test profiles have been dispatched, the end user expects to be able to easily browse the full list of files that would have been changed (with the changed content if verbose output is requested) and the full list of commands that would have been executed on any number of hosts as a result of deploying a sandbox to a domain. This requires more than the current ad-hoc messages that an NCM component outputs in no action mode. It requires a consistent way of defining across all NCM components what actions are to be performed, how and in what order. But once you've solved that problem, you might as well also provide the undo commands, then you get the rest. Which is what I've done.

So it's not so easy to do just test profiles, and not provide a rollback mechanism. In fact, if we did test profiles correctly, providing a rollback mechanism on top of that would be not much extra work.

What I don't currently understand is what CAF::History is supposed to be doing for us. Even reading the perldoc in the source code I'm clueless as to what it is really meant for. If it wasn't for CAF::History I could knock-up a cleanroom implementation of the above idea reasonably easily. But now it seems that a little piece of the functionality I need is already implemented in a way I don't understand. So do I ignore it and code around it (and eventually I guess CAF::History will become redundant), or do I try to understand it and integrate it with this idea?

stdweird commented 8 years ago

@msmark you could already have test profiles that can only trigger --noaction without anything else implemented. once the rest of the proposed framework is in place we can have it run with --noaction and record the tasks etc etc. i consider those 2 separated. inparticular, they way the tasks are created and how the recording and play functionality work could also be seen separate from test profiles, it should work for any profile (and could thus already be added on top of current code).

wrt CAF::History, it is simply an internal list and an API to add entries (via a ->event method call via the reporter) to this internal list and a query interface to determine which of the events in the list one wants to keep or process or whatever. to make it useful, you have to add these events calls to all intersting places (but the hope is/was that this would not be needed outside CAF, and the scope was to be able to rollback FileWriter/Editor changes and improve overall reporting; certainly not what you propose)

so you can certainly ignore CAF::History if you think it's not sufficient/too complicated/not well suited.

msmark commented 8 years ago

I see what you mean. Yes, the test profile idea will run components with --noaction and representation of the tasks that would have been performed needs to be reported back. Yes I agree that it could be separated because, although the current output format is inconsistent, it would be something, and once we'd got around to changing the components to use a more consistent format that could be sent instead. It would then be useful, as you say, on top of current code as well as compatible with a future design. This seems a reasonable way to me of breaking up the work into a number of discrete steps.

msmark commented 8 years ago

However, that said, even the test profile has Aquilon components to it. For example, being able to atomically deploy and compile in a single step, and get back the results. Of points 3, 4 and 7 in the issue description, only point 4 relates to CCM.

So I could create one Aquilon issue covering points 3, 4 and 7 so that we could discuss the larger implications of supporting test profiles first? Before separating these issues into subtasks?

stdweird commented 8 years ago

@msmark yes, try to factor out the aquilon specific bits. similarly, you seem to want to communicate with the host via ncm-cdispd. you should also open an issue for that in ncm-cdispd to keep track of what kind of communication we should support (currently, it is almost nothing :smile: )

to further simplify the whole picture, i would even consider separating the remote part from the local part: assume we have a test profile in the correct location on the host (somehow), what should happen when we run ncm-ncd --testprofile (or whatever), how do we rollback on the host by hand, etc etc. all the CAF::Executor magic seems to be only relevant to the host or should at the very least work locally on the host, without any interaction from a remote site.

msmark commented 8 years ago

@stdweird I don't want to communicate with the host in any particular way, it's probably my lack of understanding as to what communication goes on today and exactly where in the stack it happens. So if there is a more logical place for it, please let me know where it is. At the moment, I am guessing :smile:

Separating the remote and local parts also seems like a reasonable approach. You're quite right that there must be a way of doing all of these tasks individually by hand on a host, as well as have Aquilon orchestrate the tasks over many thousands of hosts.

msmark commented 8 years ago

@stdweird looking into this a bit further today, I don't think we need any changes to ccm-fetch or to the data transferred on-the-wire. Aquilon will maintain test profiles in a different location and they will therefore be accessible via a different URL. Today we run ccm-fetch via cron every 60 seconds to fetch from a URL to a local cache directory. We could have a second cron job that ran ccm-fetch --cfgfile=/etc/ccm-test.conf and inside /etc/ccm-test.conf we specify the alternative URL from where to obtain test profiles and where to write them (e.g. /var/lib/ccm-test).

So then the question becomes, do we modify ncm-cdispd to monitor the two different cache directories for incoming profiles (treating the test area separately), or do we spin up a second an independent ncm-cdispd process with the test options to get it to handle incoming test profiles? I suppose the latter approach might get tripped up by locks, but we could fix that.

msmark commented 8 years ago

@stdweird Ah, my mistake, ccm-fetch is run once an hour. It is actually cdp-listend that receives the CDP notification that launches ccm-fetch when a new profile has been generated for the host.

In which case, I think cdp-listend will need to differentiate between live vs. test profiles via the notification type and launch ccm-fetch with a new option if it is a test profile to tell it to fetch from an alternate URL and deposit in an alternate directory, then ncm-cdispd can monitor both locations and act accordingly.

That would be a good start, I'll raise some separate issues for that piece of the puzzle. I haven't decided yet how to get the results back to a central location, probably need to post them back to another URL, the reverse of ccm-fetch I suppose (ccm-post?) but what triggers that?

stdweird commented 8 years ago

sending the results can be handled by ncm-ncd, esp if you implement https://github.com/quattor/ncm-ncd/issues/49

msmark commented 8 years ago

@stdweird Thanks, nice suggestion.

Btw, as this touches various areas I need a name to refer to the whole piece. Unless there is an objection, I'd like to adopt the name Project Igneous to refer to this whole lofty goal of "Rock solid test deployments and back-outs". It means when I raise issues in various places and refer them each back to this issue, I can do so in a concise manner.

stdweird commented 8 years ago

@msmark or paste the url of this issue in any comment (or the description), and all this github issue will show all issues and/or PR that reference it.

jouvin commented 8 years ago

I agree with @stdweird that this is probably a better way to reference the discussion in a useful way...

quattor / aquilon

Rock solid test deployments and back-outs #34