Closed Deshke closed 5 years ago
Thanks for bringing this up, I believe the behavior you are seeing is the expected behavior. The command salt <minion> file.access
is executed successfully in both cases, so the return value in both cases is 0
, regardless of the result of the file.access
function itself.
hey, sorry. i would expect a different behavior, like if my test goes wrong, the return code should be 1 (http://tldp.org/LDP/abs/html/exitcodes.html) regardless if my salt command runs successfully.
salt <minion> file.file_exists /some/file/that/doesn't/exists ; echo $?
False
0
salt <minion> file.file_exists /some/file/that/exists ; echo $?
True
0
My reason for thinking a return of 0
was appropriate in this circumstance is that I am thinking of a return code as local to a system. Since salt
runs successfully as intended on the master, it's return code would not be affected by the state of the remote minion. I admit that my reasoning on this is not very solid and am inclined to rather support your arguments.
A further consultation of the applicable standards as you have suggested is needed. I believe there is an issue somewhere about generally refactoring return statuses/return codes. I can't find it at the moment, but that ought to be referenced here.
Here it is: #7013.
@jfindlay , thanks for the info on #7013. This bug should be classified above "Low Severity", IMO.
@tjyang, you're right. It seems that not all the return code issues were not resolved with #11337.
This is the plan for unifying and standardizing salt exit codes.
@thatch45, @cachedout, @basepi, or @UtahDave may have further commentary on this issue.
+1
+1
+1
I've just raised/tested few other similar issues particularly important for me. Searching for "zero exit" gives a list of issues related to lack of error indication for CLI.
It effectively makes automation (the very similar reason why Salt is even used in the first place) around Salt commands very inconvenient. And it definitely seems pervasive for Salt (many CLI commands, many functions called through CLI, etc.).
For example, the following are some use cases which require developing custom scripts to analyze Salt output just get Failed/Succeeded result:
relates to #2417.
Hi,
+1 @uvsmtid for any jenkins jobs, and anything using state.orchestrate
I've just revisited this place due to related problems and remembered that I wrote an ad-hoc script to sniff failures from Salt output as part of Salt-based automation project. The script, however, is independent/single/standalone Python file. Some additional details in this SO answer.
@uvsmtid, do you know if there are other github issues that document the issues you mention in that stackoverflow answer? I have some ideas to automate plugin conformation with unit and possibly integration testing. This is a big project that I've been wanting to do for a while and need to set aside time for it, but first I need to collect the scattered wisdom and user experience on the details for which no clear convention has been either implemented or documented. I think most users would agree that a consistent and a standard plugin behavior is much more important than a favorite plugin behavior.
@uvsmtid, also my stackoverflow credit is too low to comment there, but if you want to parse the output that is returned by salt, you may have more success writing a custom returner that ensures a valid json data structure is outputted. I know there is at least one open issue, which I am having trouble finding. related to this for the default returner, which prioritizes real time info over producing valid json since you don't know when or if minions will return before the timeout.
@jfindlay, I din't look up the output issues (1: more than one JSON object, 2: mixed STDOUT of JSON and non-JSON) - it was directly from my observation on the spot. The custom returner may be better, but I haven't written any so far - so, it's to be looked into.
We can split concerns at this point: (A) exit codes and (B) output format/convention/protocol. If you create feature-issue regarding "automate plugin conformation", I'll help to find or create non-duplicate bug-issues and link them under your solution umbrella to track.
@uvsmtid, thanks for the feedback. The stdout issues are going to have to be addressed in the returner/outputter context and you are welcome to open an issue about that because I fail at finding the other issue(s). I've dumped all my ideas about exec and state modules in #25142 and #25143, respectively.
+1
A big reason for us selecting a tool like salt is to orchestrate deployments.
@uvsmtid - thank you for your contributions to this issue
et moi +1
@jfindlay so 2015.8 is which code name?
2015.8.0 is codenamed Beryllium.
Boron is tentatively targeted for 2015.11.0
I'm amazed this was originally labelled as low priority!!!
+1
@rgardam welcome to the party :) ...nothing should surprise i'd say
Where is the proper place to cheer for this issue? Just integrated a runner into my Buildbot workflow, but would need to do a jobs.lookup_jid
on an ID parsed from the output of the original runner and in turn parse the YAML output of the lookup and examine the success
key.
I'd much rather have $?
...
My actual issue is that I see a return code of 0
even when my runner explodes (like a Python exception occurs) which is not very useful.
@bmcorser, you've found the right place, cheer away.
+1
immediately stopping builds when test.ping returns $? -ne 0 would be just nice
+1
We're using Salt as part a continuous delivery workflow, and it's really problematic that when running salt '*' state.highstate
on the master, when states fail in minions for some reason (with Result: False
in the logs), the salt
command still exits with 0. We'll have to resort to parsing log files to find out if the one thing that Salt does in that workflow (configuration management) actually succeeded or not.
+1 exit code standard is required. Running state.* vs file.file_exists (or test.ping etc) should be treated differently. And also target should also have an impact e.g. '' vs real hostname or set of hostnames. e.g. salt '' state.highstate (soft error) vs salt 'hostname' state.highstate (hard error) There is sort of a standard.. http://tldp.org/LDP/abs/html/exitcodes.html If something does not go to plan it should indicate to look at the logs with an exit code.
May be an option to indicate a person is calling it as part of a script, or set of options which indicates what the executer cares about. e.g. missing hosts, or failed to run on a host, or does not care if it does not run on a host.
May be a bit based exit code for apply state, highstate
0 - All as expected, no need to review the output
1 - Not All Targets Reachable
2 - Issues Applying states
4 -
8 -
16 -
32 - (last one) as exit code 127 is not found
1+2= exit code 3
+1
The whole CI+Salt setup is insanely powerful, but somewhat limited by this.
It is also important to expose this to the salt-api.
Here is the output from a 2015.8.8 salt command invocation of cmd.run 'false 1'
:
# salt 'web1' cmd.run 'false 1'
web1:
ERROR: Minions returned with non-zero exit code
While the API does not include the necessary info when sending to the / endpoint:
{
"return": [
{
"web1": ""
}
]
}
When you lookup the job via /jobs/JID you get slightly more info, but still nothing useful for determining exit status.
{
"info": [
{
"Arguments": [
"false 1"
],
"Function": "cmd.run",
"Minions": [
"web1"
],
"Result": {
"web1": {
"return": ""
}
},
"StartTime": "2016, Apr 11 15:59:42.643646",
"Target": "web1",
"Target-type": "glob",
"User": "jenkins",
"jid": "20160411155942643646"
}
],
"return": [
{
"web1": ""
}
]
}
and this same issue occurs with the service execution module. Here is an example of a service failing to restart:
{
"info": [
{
"Arguments": [
"klsdklsd"
],
"Function": "service.restart",
"Minions": [
"web1"
],
"Result": {
"web1": {
"return": false
}
},
"StartTime": "2016, May 27 14:08:28.970138",
"Target": "web1",
"Target-type": "glob",
"User": "jenkins",
"jid": "20160527140828970138"
}
],
"return": [
{
"web1": false
}
]
}
+1 How come there is still no way of knowing the exit status of a remote command (cmd.run) ? salt-run jobs.print_job [jid] doesn't return anything status-related... :(
This not being addressed as top priority is a pure negligence. How do you think salt* usage would be scripted without this bug being resolved? How do you test your executables if your exit codes don't have a meaning??
+1 This issue has caused great pain in our normal deployment process. Actually it has caused a few production live site issues already. We have to apply other ad-hoc detection of the deployment result which is really unnecessary. For state.highstate at least we have succeeded and failed summary, but for others such as state.apply or cmd.run we have to review all the output and check if the command really succeeds or not.
I can't believe this is addressed as low priority. It is opened 2 years ago and still open.
This is really serious compliance issue. How I can make sure without parsing output that all my states applied correctly or will applied correctly?
+1
:+1:
@blacked We are using serverspec on every salt run to make sure that the states are actually applied correctly.
@meggiebot Please include me when this is added to a sprint for Carbon. I would like to coordinate these changes with salt-api and Salt Enterprise so that all three are on the same page going forward.
+1
Isn't this what --retcode-passthrough
is supposed to do?
I could be mistaken but I believe this is what --retcode-passthrough
is supposed to expose.
@whiteinge , somehow I'm just not understanding what the nuances are of your term "expose". I get the feeling I'm missing something.
My understanding is that when --retcode-passthrough
is specified that the aggregate exit status of the command is returned rather than the status of the salt
executable itself.
As far as compliance (@blacked), it would be reasonable to query the job cache to see if any of the retcodes
were non-zero. (yes, I get that it's somewhat awkward)
retcode-passthrough is not generic for all salt tools:
[root@salt-master ~]# salt-cloud --retcode-passthrough
Usage: salt-cloud
salt-cloud: error: no such option: --retcode-passthrough
Right, --retcode-passthrough
does this, and it does not really make a lot of sense for salt cloud, because salt-cloud should just five us good retcodes.
The real issue here is that we need to create a standard retcode definition and location, then go through the execution modules and normalize the retcodes to the standard
I would expect that there would be a range of retcodes for the specific tool and then another range for failures in anything that the tool manages. For example 0-127 might represent status for the specific tool (viz. salt
) where subprocesses/managed devices would have an automatic passthrough of their retcode+128. That way if all you care about is pass/fail you get it without the --retcode-passthrough
option. If you want to distinguish between a failure in the invoked tool vs. managed devices then you could test the range on the retcode. One thing to keep in mind is that status of 128+SIGNUM. Often times on POSIX exit status is only 8 bits rather than the full int - although newer calls that use waitid()
have access to the full int (which is most software by now).
I have no idea what Windows does.
We could certainly compress to a narrow subrange just for minion retcode pass-through.
There is also an additional problem: several of the CLI tools have broken exit status handling where some of them always exit with status 0
even when there is an error. I have been working through a set of fixes as well as integration tests for many of the CLI tools. Some of them are submitted and merged and others are waiting acceptance.
Examples: * #33762 * 1001d2b9e7e630c005cb24d5c43211f5d47d1b11 * 48fc51cb7b39b2d02c5b4b7adcc904962180a54b
Similar problem with salt-call... exits with 0 even when there's a serious error, e.g. "Unable to render top file". --retcode-passthrough does not remediate this.
Best solution is to tee the output to a file, and then grep for error messages.
Give me a working example and I will see what I can do
ZD-935
I fixed about one of these bugs, #35964 .
is there a reason why test.* returns False and exits with bash exit code 0 instead of 1 ?