Open msmark opened 8 years ago
Further to this, and in response to the inevitable question "can't we just reapply the old host profile in order to rollback a change?", let me simplify the problem definition by drawing focus on a single NCM component, but bear in mind that we will be able to apply this easily across multiple components.
A component has CIs that are in scope, and there will be lots more out of scope. By in scope, I mean specifically that the Perl code in /usr/lib/perl/NCM/Component/<mycomponent>.pm
explicitly configures it. As long as all of the changes from one host profile to another host profile touch CIs that are in scope, then we can (assuming no bugs) move from any host profile to any other host profile just by re-running the component. This is CM utopia, but it is an impractical and unachievable one, and this is why:
If we have independently recorded every change made by every NCM component, in order, indexed by a change number that can be traced back to a single aq deploy
, we transcend bugs in NCM components and we transcend the problems of upgrading or downgrading the NCM components as well. If we additionally record both before and after states for CIs, we can determine quite safely whether we have all of the information at our disposal to automatically roll back the deployment of a sandbox. I don't think it would be too hard to get to a point where we could have the framework roll back X number of changes, and then optionally reapply a subset of those changes, in order to unpick a bad change that we need to eject from our environment in an emergency.
@msmark thanks for detailed explanation. i feel a bit relunctant to start typing comments in here :smile:
can you open an issue in ncm-cdispd/ncm-ncd or CCM wrt the test profile part? (if possible clarify the following: does the test profile have some (meta)data in the profile itself to mark it as testonly (in the sense that you would need to compile a different one to use it in production?) if not, how do you propose it can be seens as a test profile?) it looks like an issue we can more easily resolve and factor out from the larger discussion.
my main concern is that this is not the next step for quattor, but already a few steps in the future. i would really like to see the components cleaned up and use more CAF first. from your proposal, it looks like you want to skip all that and wrap it in something magical. it would be nice if we could isolate some piece of work and try to make a proof of concept (ideally something around CAF::FileWriter, or a trivial component like filecopy).
The test profile part is tied in with the grander design. Once test profiles have been dispatched, the end user expects to be able to easily browse the full list of files that would have been changed (with the changed content if verbose output is requested) and the full list of commands that would have been executed on any number of hosts as a result of deploying a sandbox to a domain. This requires more than the current ad-hoc messages that an NCM component outputs in no action mode. It requires a consistent way of defining across all NCM components what actions are to be performed, how and in what order. But once you've solved that problem, you might as well also provide the undo commands, then you get the rest. Which is what I've done.
So it's not so easy to do just test profiles, and not provide a rollback mechanism. In fact, if we did test profiles correctly, providing a rollback mechanism on top of that would be not much extra work.
What I don't currently understand is what CAF::History
is supposed to be doing for us. Even reading the perldoc in the source code I'm clueless as to what it is really meant for. If it wasn't for CAF::History
I could knock-up a cleanroom implementation of the above idea reasonably easily. But now it seems that a little piece of the functionality I need is already implemented in a way I don't understand. So do I ignore it and code around it (and eventually I guess CAF::History
will become redundant), or do I try to understand it and integrate it with this idea?
@msmark you could already have test profiles that can only trigger --noaction
without anything else implemented. once the rest of the proposed framework is in place we can have it run with --noaction
and record the tasks etc etc. i consider those 2 separated. inparticular, they way the tasks are created and how the recording and play functionality work could also be seen separate from test profiles, it should work for any profile (and could thus already be added on top of current code).
wrt CAF::History, it is simply an internal list and an API to add entries (via a ->event
method call via the reporter) to this internal list and a query interface to determine which of the events in the list one wants to keep or process or whatever. to make it useful, you have to add these events calls to all intersting places (but the hope is/was that this would not be needed outside CAF, and the scope was to be able to rollback FileWriter/Editor changes and improve overall reporting; certainly not what you propose)
so you can certainly ignore CAF::History if you think it's not sufficient/too complicated/not well suited.
I see what you mean. Yes, the test profile idea will run components with --noaction
and representation of the tasks that would have been performed needs to be reported back. Yes I agree that it could be separated because, although the current output format is inconsistent, it would be something, and once we'd got around to changing the components to use a more consistent format that could be sent instead. It would then be useful, as you say, on top of current code as well as compatible with a future design. This seems a reasonable way to me of breaking up the work into a number of discrete steps.
However, that said, even the test profile has Aquilon components to it. For example, being able to atomically deploy and compile in a single step, and get back the results. Of points 3, 4 and 7 in the issue description, only point 4 relates to CCM.
So I could create one Aquilon issue covering points 3, 4 and 7 so that we could discuss the larger implications of supporting test profiles first? Before separating these issues into subtasks?
@msmark yes, try to factor out the aquilon specific bits. similarly, you seem to want to communicate with the host via ncm-cdispd
. you should also open an issue for that in ncm-cdispd
to keep track of what kind of communication we should support (currently, it is almost nothing :smile: )
to further simplify the whole picture, i would even consider separating the remote part from the local part: assume we have a test profile in the correct location on the host (somehow), what should happen when we run ncm-ncd --testprofile
(or whatever), how do we rollback on the host by hand, etc etc. all the CAF::Executor
magic seems to be only relevant to the host or should at the very least work locally on the host, without any interaction from a remote site.
@stdweird I don't want to communicate with the host in any particular way, it's probably my lack of understanding as to what communication goes on today and exactly where in the stack it happens. So if there is a more logical place for it, please let me know where it is. At the moment, I am guessing :smile:
Separating the remote and local parts also seems like a reasonable approach. You're quite right that there must be a way of doing all of these tasks individually by hand on a host, as well as have Aquilon orchestrate the tasks over many thousands of hosts.
@stdweird looking into this a bit further today, I don't think we need any changes to ccm-fetch
or to the data transferred on-the-wire. Aquilon will maintain test profiles in a different location and they will therefore be accessible via a different URL. Today we run ccm-fetch
via cron every 60 seconds to fetch from a URL to a local cache directory. We could have a second cron job that ran ccm-fetch --cfgfile=/etc/ccm-test.conf
and inside /etc/ccm-test.conf
we specify the alternative URL from where to obtain test profiles and where to write them (e.g. /var/lib/ccm-test
).
So then the question becomes, do we modify ncm-cdispd
to monitor the two different cache directories for incoming profiles (treating the test area separately), or do we spin up a second an independent ncm-cdispd
process with the test options to get it to handle incoming test profiles? I suppose the latter approach might get tripped up by locks, but we could fix that.
@stdweird Ah, my mistake, ccm-fetch
is run once an hour. It is actually cdp-listend
that receives the CDP notification that launches ccm-fetch
when a new profile has been generated for the host.
In which case, I think cdp-listend
will need to differentiate between live vs. test profiles via the notification type and launch ccm-fetch
with a new option if it is a test profile to tell it to fetch from an alternate URL and deposit in an alternate directory, then ncm-cdispd
can monitor both locations and act accordingly.
That would be a good start, I'll raise some separate issues for that piece of the puzzle. I haven't decided yet how to get the results back to a central location, probably need to post them back to another URL, the reverse of ccm-fetch
I suppose (ccm-post
?) but what triggers that?
sending the results can be handled by ncm-ncd, esp if you implement https://github.com/quattor/ncm-ncd/issues/49
@stdweird Thanks, nice suggestion.
Btw, as this touches various areas I need a name to refer to the whole piece. Unless there is an objection, I'd like to adopt the name Project Igneous to refer to this whole lofty goal of "Rock solid test deployments and back-outs". It means when I raise issues in various places and refer them each back to this issue, I can do so in a concise manner.
@msmark or paste the url of this issue in any comment (or the description), and all this github issue will show all issues and/or PR that reference it.
I agree with @stdweird that this is probably a better way to reference the discussion in a useful way...
This issue is based on a new feature request I raised in email on the quattor-discuss mailing list last week. It touches many different components, e.g. Aquilon, cdp-listend, ncm-cdispd, ncm-ncd, CAF and lots of NCM components. However, I am creating this umbrella issue to describe the high-level request from which sub-issues may be readily created.
The ultimate goal is: we need rock solid test deployments and rock solid back-outs. More specifically, to provide a way to allow an operator who is managing hundreds or thousands of hosts in a domain to:
This can be broken down into the following requirements:
aq deploy --source <sandbox> --target <domain> --compile
should atomically deploy git changes from a sandbox to a domain and compile the domain. If the compile fails, the git changes are removed. If it succeeds, when host profiles are sent across to each host, the fact that this is a deploy and the change ID involved must be communicated withncm-cdispd
. The command returns a unique change ID.aq undeploy --change <id> --compile
should atomically remove the git changes identified by the change ID and compile the domain. Not allowed without the--compile
option. When host profiles are sent across to each host, the fact that this is an undeploy and the change ID involved must be communicated withncm-cdispd
. If this is not the last deploy made to a domain, the command will fail listing the other change IDs that have been subsequently applied. An additional flag--redeploy
may be provided, which indicates that all change IDs subsequently applied must be undeployed and then re-deployed again once the selected change ID has been removed.aq deploy --compile
andaq undeploy --compile
must support a--test
option. This option requests a test deployment. This deploys to (or undeploys from) a copy of the domain, not the live domain. The profiles are compiled and shipped out to every host in the domain with a new flag indicating toncm-cdispd
that this is a test profile only. What the host does when it receives a test profile is discussed below.ncm-cdispd
receives a test profile, as described above, it puts it in a different cache than it would normally use for live profiles. Then it runs all NCM components with--noaction
(only NCM components that supportNoAction
), logging output to a test log -- not the normal log. Then it deletes the test profile.CAF::Executor
object. The object encapsulates each individual task involved, with the change ID that links them all together, as well as the actions required to undo the change and a human friendly description. This object becomes a key part of information recorded byCAF::History
but is also used by the component to execute a change. We should also have aCAF::Evaluate
object in which we can enclose arbitrary code that makes a change but that cannot be expressed by another CAF method. See exampleCAF::Executor
object below.CAF::Executor
also takes a human friendly description of the change being made, as well as the steps required to undo the change.CAF::Process
will additionally need to capture the command needed to undo the change in order to informCAF::Executor
of the same.CAF::FileWriter
andCAF::FileEditor
can automatically ensure thatCAF::Executor
has an undo capability by backing up the file before and after it makes any changes (see also point 10 below). If there is no way to undo a change, there should be a way to flag this up the framework. See below for an example visual representation of aCAF::Executor
object.aq show testlog --change <id>
or similar command that collects the output of all of the test profile runs from every host that ran--noaction
as a result of item 4 above, with a succinct but user readable list of tasks performed by each component (theCAF::Executor
objects). With the--undo
flag will show the undo commands from theCAF::Executor
objects instead. By succinct, this means one line per task across allCAF::Executor
objects in scope (see exampleCAF::Executor
object below), but with the ability to drill down into more detail if needed (e.g. a--verbose
option).ncm-cdispd
receives a new profile, it records whether this is a live profile or a test profile, and the associated change ID. It passes this information ontoncm-ncd
.ncm-ncd
executes components, it records the exact order in which each component is run with the change ID. EachCAF::Executor
orCAF::History
object created during a deploy or undeploy is logged. If doing an undeploy, it computes the appropriate order to undo changes based on theCAF::Executor
objects that were created during the deploy.CAF::Executor
objects to wrap every change they make. If a file is being changed, the original file and a copy of the new file are stashed in a different directory. This is used byCAF::FileWriter
orCAF::FileEditor
to check and compute an appropriate rollback during an undeploy.ncm-cdispd
orncm-ncd
will play rollback commands logged inCAF::Executor
objects. It will not expect the NCM component to understand how to revert the state. This is because an NCM component is only good at handling what it thinks is currently in scope. Its view of the world changes as conditional logic within the Perl code routes down different code paths, and as new versions of NCM components are delivered. By recording an exact list of undo commands at the time that the change is made, it can be guaranteed that changes can be successfully reversed even if the NCM component code has been subsequently modified or removed. See theCAF::Executor
example below, note that in many cases recording whichCAF
method was used and the arguments are enough for the history. After a successful undeploy, re-runningncm-ncd
with the rolled back (now current) profile should be the same as a no-op.Here is an example visual representation of a
CAF::Executor
object for a component that wants to change a file and then send a HUP signal to a process. You'll see it essentially groups together a bunch of otherCAF
methods: