saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Get access to the Salt software package repository here:
https://repo.saltproject.io/
Apache License 2.0
14.1k stars 5.47k forks source link

Question: How to tell when a job has finished #18201

Closed mclarkson closed 9 years ago

mclarkson commented 9 years ago

I am writing a GUI (see https://github.com/mclarkson/obdi) but I'm having trouble finding out when a job has finished, specifically when I do a state.highstate in the background using cmd_async.

The only way I can see is to check that some output was produced, but what about when no output is produced? (such as with issue #17957 - salt always fails the first time after a restart and I get no output, but there is one other case I have encountered)

Someone else suggested having a finish time, #18093 - Show duration in state summary, which would work.

Or is there a reliable way to be able to tell when a job has finished?

cachedout commented 9 years ago

Thanks for this question, Mark.

I'll provide a somewhat high-level overview of job handling in a typical scenario from the perspective of the CLI, since it sounds like that's what you're looking for. If my explanation isn't sufficient, you might also want to refer to the architecture document that describes the flow of a job from sending to receipt: http://docs.saltstack.com/en/latest/topics/development/architecture.html

As it sounds like you are already aware, the jobs are sent to the master and returns from those jobs are received from the master by a class called LocalClient. Roughly speaking, it has a couple of steps that it follows in its typical usage:

1) It sends the intended command to the master for publication. This data includes the minions to be targeted, the function to be run (such as state.highstate), and any arguments to be passed to the function.

2) The master calculates the list of minions from which it expects to receive a response, based on its knowledge of how the target will resolve. It returns this list back to the LocalClient along with a JID [Job ID] that will be used to track the running job.

3) The LocalClient begins to listen to the master's event bus, looking for returns that match the JID associated with the given job.

4) The job is published to all minions and minions hear the publication, examine the targeting data and run the function if they believe themselves to be a match based on the targeting data.

5) The minion returns results back to the master, including the job ID.

6) The LocalClient continues to listen for returns until either a timeout is reached or all expected minions have returned. If the timeout is reached and all minions have not returned, the LocalClient sends out a 'saltutil.find_job' command which tells the minions to report back if they have received the given job but have not yet finished.

7) Results are displayed on the CLI.

Now, to your question more specifically -- when running an async command, one common approach is to examine the master job cache for returns. To know what returns to expect, examine the 'minions' key in the dictionary that is returned to you. In this case of calling cmd_async directly, if you wish to implement re-try or 'find_job' logic, it is up to you to do because cmd_async is "fire-and-forget". In short, I believe the answer to your question is to implement some features of the full CLI client in your implementation, firing off 'saltutil.find_job' commands to minions that have not yet returned and exiting after a given period without a response.

Hope that helps.

On Tue, Nov 18, 2014 at 5:51 AM, Mark Clarkson notifications@github.com wrote:

I am writing a GUI (see https://github.com/mclarkson/obdi) but I'm having trouble finding out when a job has finished, specifically when I do a state.highstate in the background using cmd_async.

The only way I can see is to check that some output was produced, but what about when no output is produced? (such as with issue #17957 https://github.com/saltstack/salt/issues/17957 - salt always fails the first time after a restart and I get no output, but there is one other case I have encountered)

Someone else suggested having a finish time, #18093 https://github.com/saltstack/salt/issues/18093 - Show duration in state summary, which would work.

Or is there a reliable way to be able to tell when a job has finished?

— Reply to this email directly or view it on GitHub https://github.com/saltstack/salt/issues/18201.

mclarkson commented 9 years ago

Many thanks for the detailed reply Mike.

Am I right in saying 'saltutil.find_job' will only find a job if it is currently running?

So does the following logic sound sane to you?

  1. Do a state.highstate on the master
  2. Poll for result using 'salt-run jobs.list_job' periodically on the master
  3. After a number of polls check the minion using 'salt minion saltutil.find_job' a. If the job is not listed then it's finished. b. Check the master again using list_job. c. If 3a and 3b are empty then the job finished - probably failed (no output)
cachedout commented 9 years ago

Yes, that sounds viable to me. :]

-mp

On Tue, Nov 18, 2014 at 9:56 AM, Mark Clarkson notifications@github.com wrote:

Many thanks for the detailed reply Mike.

Am I right in saying 'saltutil.find_job' will only find a job if it is currently running?

So does the following logic sound sane to you?

  1. Do a state.highstate on the master
  2. Poll for result using 'salt-run jobs.list_job' periodically on the master
  3. After a number of polls check the minion using 'salt minion saltutil.find_job' 3a. If the job is not listed then it's finished. 3b. Check the master again using list_job. 3c. If 3a and 3b are empty then the job finished - probably failed (no output)

— Reply to this email directly or view it on GitHub https://github.com/saltstack/salt/issues/18201#issuecomment-63503829.

mclarkson commented 9 years ago

Thanks again. I was completely stuck!

cachedout commented 9 years ago

Salt is so asynchronous that it's quite easy to get turned around! Glad you've got it figured out. :]

On Tue, Nov 18, 2014 at 10:00 AM, Mark Clarkson notifications@github.com wrote:

Thanks again. I was completely stuck!

— Reply to this email directly or view it on GitHub https://github.com/saltstack/salt/issues/18201#issuecomment-63504554.

dr4Ke commented 9 years ago

From a master perspective, could we assume that a job is finished overall all minions when the result of salt-run jobs.active does not contain the job id?

cachedout commented 9 years ago

Yes, but it isn't the most efficient way to do it. In that approach, you're asking all minions if they have finished the job, not just the minions who were running the job to begin with.

Ideally you would want to stick to the scope of the list of minions returned by the master on job publication because, often, jobs will finish very quickly and it won't be necessary to poll all connected minions to ask if they have finished the job. The workflow should look like:

1) Publish the job 2) Master replies with list of minions 3) Collect returns matching JID 4) If not all minions from step 2 have returned, poll only those which have not finished. Alternatively, you could use 'salt-run jobs.active', with the caveat that it will ask all minions if they are running the job and not just those you targeted in step 1.

-mp

On Thu, Nov 20, 2014 at 6:14 AM, Christophe Drevet <notifications@github.com

wrote:

From a master perspective, could we assume that a job is finished overall all minions when the result of salt-run jobs.active does not contain the job id?

— Reply to this email directly or view it on GitHub https://github.com/saltstack/salt/issues/18201#issuecomment-63805598.

dr4Ke commented 9 years ago

Oh. I didn't realized that. Thank you.

Is the workflow you describe achievable with standard salt* commands? Or is this available through API? I mean the salt command, in my mind, is doing all first 3 steps, and wait for completion to print the result, isn't it?

Maybe I'm trying to solve a problem I should not have, in the beginning. What happens is that, sometimes, the salt command is returning without waiting for all minions to complete (just some, or even none). Then I'm using salt-run jobs.list_jobs, or salt-run jobs.active and salt-run jobs.lookup_jid JID to check the results. It happens quite often, so I wrote a simple script that does that more easily.

cachedout commented 9 years ago

The workflow described is what's used by the 'salt' command behind the scenes.

If you're seeing premature exits, that's obviously a problem. What version of Salt are you running that you see this behaviour on? We have fixed a large number of bugs in this area in recent months and upgrading could solve your issue.

-mp

On Thu, Nov 20, 2014 at 12:52 PM, Christophe Drevet < notifications@github.com> wrote:

Oh. I didn't realized that. Thank you.

Is the workflow you describe achievable with standard salt* commands? Or is this available through API? I mean the salt command, in my mind, is doing all first 3 steps, and wait for completion to print the result, isn't it?

Maybe I'm trying to solve a problem I should not have, in the beginning. What happens is that, sometimes, the salt command is returning without waiting for all minions to complete (just some, or even none). Then I'm using salt-run jobs.list_jobs, or salt-run jobs.active and salt-run jobs.lookup_jid JID to check the results. It happens quite often, so I wrote a simple script that does that more easily.

— Reply to this email directly or view it on GitHub https://github.com/saltstack/salt/issues/18201#issuecomment-63868388.

dr4Ke commented 9 years ago

Yep. I'm using 2014.1.10 these days. I realized today that 2014.7.0 was out. I have some tests to do before I can update to this version, as I maintain some local states and modules.

I think I'll open another issue if I continue to have this odd behavior. Thanks for your answers.

cachedout commented 9 years ago

Sounds good. If you do see those premature exits after upgrading, please don't hesitate to file a Github issue with us so that we can figure out what's going on. Thanks!

-mp

On Thu, Nov 20, 2014 at 1:18 PM, Christophe Drevet <notifications@github.com

wrote:

Yep. I'm using 2014.1.10 these days. I realized today that 2014.7.0 was out. I have some tests to do before I can update to this version, as I maintain some local states and modules.

I think I'll open another issue if I continue to have this odd behavior. Thanks for your answers.

— Reply to this email directly or view it on GitHub https://github.com/saltstack/salt/issues/18201#issuecomment-63872394.