Add parallelLimit to knife.queryAndRun

dlobue commented 10 years ago

Update

See my comment below for more details.

Add ability to limit the number of concurrent tasks run against the servers knife returns. Also support functions as the value for parallelLimitFunc so users can limit the number of tasks run in parallel based on the number of servers found.

The primary motivation for this is to make it possible to do a rolling run of chef on servers.

This is fully backwards compatible, parallelLimitFunc is optional.

Note: if the task being run by queryAndRun fails, then that task will NOT be run on the remaining servers. If the parallelLimitFunc is more than one, however, and the jobs running against the other servers will finish their runs.

One thing to be aware of: there is no guarantee that jobs will be run on servers in the same order. If one run fails, the next run will most likely NOT start on the server that failed last time.

dlobue commented 10 years ago

Updated the callback stuff. queryAndRun will now abort (ie- it will not run the given command against any more servers) on failure, but ONLY if the parallelLimitFunc option has been set.

I did this because the version of async we're using does not have mapLimit function, and I am told by teammates that newer versions of async may cause breakage in dreadnot. The latest commit is a compromise solution that maintains backwards compatibility while adding the necessary ability to abort, and doesn't change a library which is a major underpinning of dreadnot.

Thoughts? If it is felt that it is necessary to be able to use parallelLimitFunc without aborting, let me know and I'll come up with a different solution.

jirwin commented 10 years ago

LGTM +1

Do you have a testing plan in place? There aren't really tests in the DN repo. What can go wrong if this doesn't work right?

dlobue commented 10 years ago

If by testing plan you mean unit tests, no. If you mean a plan for verifying functionality before deploying this code anywhere critical, then yes.

The plan is to deploy it to staging and run a deploy. If that succeeds, run another deploy and in the middle sabotage the command being run (ex- chef-client) so that it fails to ensure backwards compatibility for fails. I was then planning to use the BF stack (whenever I get the +1s on it) to verify parallelLimitFunc works correctly. I'd then do a second run to and sabotage it like before to ensure that the abort functionality works correctly.

as for what can go wrong if it doesn't work right:

the stack using knife.queryAndRun fails when knife.queryAndRun is used.
the command given to queryAndRun is run against all nodes when it should have stopped after a failure.
the command given to queryAndRun is only run against some of the nodes when it should have run against all.

the ramifications of the second two depending of course on exactly what command/function is being run.

jirwin commented 10 years ago

Cool. Lets try this out somewhere in staging and then call it good.

dlobue commented 10 years ago

works in staging.

rax-maas / dreadnot

Add parallelLimit to knife.queryAndRun #54