salt exit codes - Githubissues

Deshke commented 9 years ago

is there a reason why test.* returns False and exits with bash exit code 0 instead of 1 ?

~ # salt some-minion file.access  /var/run/reboot-required f; echo $?
<>:
    False
0

~ #  salt some-other-minion  file.access  /var/run/reboot-required f; echo $?
<>:
    True
0

jfindlay commented 9 years ago

Thanks for bringing this up, I believe the behavior you are seeing is the expected behavior. The command salt <minion> file.access is executed successfully in both cases, so the return value in both cases is 0, regardless of the result of the file.access function itself.

Deshke commented 9 years ago

hey, sorry. i would expect a different behavior, like if my test goes wrong, the return code should be 1 (http://tldp.org/LDP/abs/html/exitcodes.html) regardless if my salt command runs successfully.

salt <minion> file.file_exists /some/file/that/doesn't/exists ; echo $?
   False
0

salt <minion> file.file_exists /some/file/that/exists ; echo $?
   True
0

jfindlay commented 9 years ago

My reason for thinking a return of 0 was appropriate in this circumstance is that I am thinking of a return code as local to a system. Since salt runs successfully as intended on the master, it's return code would not be affected by the state of the remote minion. I admit that my reasoning on this is not very solid and am inclined to rather support your arguments.

A further consultation of the applicable standards as you have suggested is needed. I believe there is an issue somewhere about generally refactoring return statuses/return codes. I can't find it at the moment, but that ought to be referenced here.

jfindlay commented 9 years ago

Here it is: #7013.

tjyang commented 9 years ago

@jfindlay , thanks for the info on #7013. This bug should be classified above "Low Severity", IMO.

jfindlay commented 9 years ago

@tjyang, you're right. It seems that not all the return code issues were not resolved with #11337.

jfindlay commented 9 years ago

This is the plan for unifying and standardizing salt exit codes.

salt beryllium:
- Introduce a standard exit code enum-like data structure, probably somewhere in exitcodes.py such that integers and enums could be used interchangeably, but with enums being preferred within the source code.
- Add a configuration option to enable the use of this new exit code standard by automatically evaluating the truthiness of the return value of the execution module functions and translating it into exit codes.
salt boron: Ensure that all salt execution modules explicitly comply with the new salt exit code standard.

@thatch45, @cachedout, @basepi, or @UtahDave may have further commentary on this issue.

inthecloud247 commented 9 years ago

+1

simple10 commented 9 years ago

+1

uvsmtid commented 9 years ago

+1

I've just raised/tested few other similar issues particularly important for me. Searching for "zero exit" gives a list of issues related to lack of error indication for CLI.

Lack of usability

It effectively makes automation (the very similar reason why Salt is even used in the first place) around Salt commands very inconvenient. And it definitely seems pervasive for Salt (many CLI commands, many functions called through CLI, etc.).

For example, the following are some use cases which require developing custom scripts to analyze Salt output just get Failed/Succeeded result:

VM provisioning tools (such as Vagrant with all possible providers) which use Salt to configure new instance won't detect failure.
Any Jenkins (or other CI platform) jobs which apply configuration or run other Salt jobs won't detect failure.

jfindlay commented 9 years ago

relates to #2417.

ealphonse commented 9 years ago

Hi,

+1 @uvsmtid for any jenkins jobs, and anything using state.orchestrate

uvsmtid commented 9 years ago

I've just revisited this place due to related problems and remembered that I wrote an ad-hoc script to sniff failures from Salt output as part of Salt-based automation project. The script, however, is independent/single/standalone Python file. Some additional details in this SO answer.

jfindlay commented 9 years ago

@uvsmtid, do you know if there are other github issues that document the issues you mention in that stackoverflow answer? I have some ideas to automate plugin conformation with unit and possibly integration testing. This is a big project that I've been wanting to do for a while and need to set aside time for it, but first I need to collect the scattered wisdom and user experience on the details for which no clear convention has been either implemented or documented. I think most users would agree that a consistent and a standard plugin behavior is much more important than a favorite plugin behavior.

jfindlay commented 9 years ago

@uvsmtid, also my stackoverflow credit is too low to comment there, but if you want to parse the output that is returned by salt, you may have more success writing a custom returner that ensures a valid json data structure is outputted. I know there is at least one open issue, which I am having trouble finding. related to this for the default returner, which prioritizes real time info over producing valid json since you don't know when or if minions will return before the timeout.

uvsmtid commented 9 years ago

@jfindlay, I din't look up the output issues (1: more than one JSON object, 2: mixed STDOUT of JSON and non-JSON) - it was directly from my observation on the spot. The custom returner may be better, but I haven't written any so far - so, it's to be looked into.

We can split concerns at this point: (A) exit codes and (B) output format/convention/protocol. If you create feature-issue regarding "automate plugin conformation", I'll help to find or create non-duplicate bug-issues and link them under your solution umbrella to track.

jfindlay commented 9 years ago

@uvsmtid, thanks for the feedback. The stdout issues are going to have to be addressed in the returner/outputter context and you are welcome to open an issue about that because I fail at finding the other issue(s). I've dumped all my ideas about exec and state modules in #25142 and #25143, respectively.

bradthurber commented 9 years ago

+1

A big reason for us selecting a tool like salt is to orchestrate deployments.

@uvsmtid - thank you for your contributions to this issue

PredatorVI commented 9 years ago

et moi +1

DanyC97 commented 9 years ago

@jfindlay so 2015.8 is which code name?

basepi commented 9 years ago

2015.8.0 is codenamed Beryllium.

Boron is tentatively targeted for 2015.11.0

rgardam commented 9 years ago

I'm amazed this was originally labelled as low priority!!!

+1

DanyC97 commented 8 years ago

@rgardam welcome to the party :) ...nothing should surprise i'd say

bmcorser commented 8 years ago

Where is the proper place to cheer for this issue? Just integrated a runner into my Buildbot workflow, but would need to do a jobs.lookup_jid on an ID parsed from the output of the original runner and in turn parse the YAML output of the lookup and examine the success key.

I'd much rather have $? ...

My actual issue is that I see a return code of 0 even when my runner explodes (like a Python exception occurs) which is not very useful.

jfindlay commented 8 years ago

@bmcorser, you've found the right place, cheer away.

cehrig commented 8 years ago

+1

immediately stopping builds when test.ping returns $? -ne 0 would be just nice

fernandoacorreia commented 8 years ago

+1

We're using Salt as part a continuous delivery workflow, and it's really problematic that when running salt '*' state.highstate on the master, when states fail in minions for some reason (with Result: False in the logs), the salt command still exits with 0. We'll have to resort to parsing log files to find out if the one thing that Salt does in that workflow (configuration management) actually succeeded or not.

damon-atkins commented 8 years ago

+1 exit code standard is required. Running state.* vs file.file_exists (or test.ping etc) should be treated differently. And also target should also have an impact e.g. '' vs real hostname or set of hostnames. e.g. salt '' state.highstate (soft error) vs salt 'hostname' state.highstate (hard error) There is sort of a standard.. http://tldp.org/LDP/abs/html/exitcodes.html If something does not go to plan it should indicate to look at the logs with an exit code.

May be an option to indicate a person is calling it as part of a script, or set of options which indicates what the executer cares about. e.g. missing hosts, or failed to run on a host, or does not care if it does not run on a host.

May be a bit based exit code for apply state, highstate

 0 - All as expected, no need to review the output
 1 - Not All Targets Reachable
 2 - Issues Applying states
 4 -
 8 - 
16 -
32 - (last one) as exit code 127 is not found

1+2= exit code 3

kako-nawao commented 8 years ago

+1

The whole CI+Salt setup is insanely powerful, but somewhat limited by this.

mchugh19 commented 8 years ago

It is also important to expose this to the salt-api. Here is the output from a 2015.8.8 salt command invocation of cmd.run 'false 1':

# salt 'web1' cmd.run 'false 1'
web1:
ERROR: Minions returned with non-zero exit code

While the API does not include the necessary info when sending to the / endpoint:

{
    "return": [
        {
            "web1": ""
        }
    ]
}

When you lookup the job via /jobs/JID you get slightly more info, but still nothing useful for determining exit status.

{
    "info": [
        {
            "Arguments": [
                "false 1"
            ],
            "Function": "cmd.run",
            "Minions": [
                "web1"
            ],
            "Result": {
                "web1": {
                    "return": ""
                }
            },
            "StartTime": "2016, Apr 11 15:59:42.643646",
            "Target": "web1",
            "Target-type": "glob",
            "User": "jenkins",
            "jid": "20160411155942643646"
        }
    ],
    "return": [
        {
            "web1": ""
        }
    ]
}

and this same issue occurs with the service execution module. Here is an example of a service failing to restart:

{
    "info": [
        {
            "Arguments": [
                "klsdklsd"
            ],
            "Function": "service.restart",
            "Minions": [
                "web1"
            ],
            "Result": {
                "web1": {
                    "return": false
                }
            },
            "StartTime": "2016, May 27 14:08:28.970138",
            "Target": "web1",
            "Target-type": "glob",
            "User": "jenkins",
            "jid": "20160527140828970138"
        }
    ],
    "return": [
        {
            "web1": false
        }
    ]
}

zerthimon commented 8 years ago

+1 How come there is still no way of knowing the exit status of a remote command (cmd.run) ? salt-run jobs.print_job [jid] doesn't return anything status-related... :(

cjelli commented 8 years ago

This not being addressed as top priority is a pure negligence. How do you think salt* usage would be scripted without this bug being resolved? How do you test your executables if your exit codes don't have a meaning??

zhany commented 8 years ago

+1 This issue has caused great pain in our normal deployment process. Actually it has caused a few production live site issues already. We have to apply other ad-hoc detection of the deployment result which is really unnecessary. For state.highstate at least we have succeeded and failed summary, but for others such as state.apply or cmd.run we have to review all the output and check if the command really succeeds or not.

I can't believe this is addressed as low priority. It is opened 2 years ago and still open.

adubkov commented 8 years ago

This is really serious compliance issue. How I can make sure without parsing output that all my states applied correctly or will applied correctly?

htssouza commented 8 years ago

+1

rhansen commented 8 years ago

:+1:

rgardam commented 8 years ago

@blacked We are using serverspec on every salt run to make sure that the states are actually applied correctly.

whiteinge commented 8 years ago

@meggiebot Please include me when this is added to a sprint for Carbon. I would like to coordinate these changes with salt-api and Salt Enterprise so that all three are on the same page going forward.

ryanchapman commented 8 years ago

+1

plastikos commented 8 years ago

Isn't this what --retcode-passthrough is supposed to do?

whiteinge commented 8 years ago

I could be mistaken but I believe this is what --retcode-passthrough is supposed to expose.

plastikos commented 8 years ago

@whiteinge , somehow I'm just not understanding what the nuances are of your term "expose". I get the feeling I'm missing something.

My understanding is that when --retcode-passthrough is specified that the aggregate exit status of the command is returned rather than the status of the salt executable itself.

As far as compliance (@blacked), it would be reasonable to query the job cache to see if any of the retcodes were non-zero. (yes, I get that it's somewhat awkward)

cjelli commented 8 years ago

retcode-passthrough is not generic for all salt tools:

[root@salt-master ~]# salt-cloud --retcode-passthrough
Usage: salt-cloud

salt-cloud: error: no such option: --retcode-passthrough

thatch45 commented 8 years ago

Right, --retcode-passthrough does this, and it does not really make a lot of sense for salt cloud, because salt-cloud should just five us good retcodes. The real issue here is that we need to create a standard retcode definition and location, then go through the execution modules and normalize the retcodes to the standard

plastikos commented 8 years ago

I would expect that there would be a range of retcodes for the specific tool and then another range for failures in anything that the tool manages. For example 0-127 might represent status for the specific tool (viz. salt) where subprocesses/managed devices would have an automatic passthrough of their retcode+128. That way if all you care about is pass/fail you get it without the --retcode-passthrough option. If you want to distinguish between a failure in the invoked tool vs. managed devices then you could test the range on the retcode. One thing to keep in mind is that status of 128+SIGNUM. Often times on POSIX exit status is only 8 bits rather than the full int - although newer calls that use waitid() have access to the full int (which is most software by now).

I have no idea what Windows does.

We could certainly compress to a narrow subrange just for minion retcode pass-through.

plastikos commented 8 years ago

There is also an additional problem: several of the CLI tools have broken exit status handling where some of them always exit with status 0 even when there is an error. I have been working through a set of fixes as well as integration tests for many of the CLI tools. Some of them are submitted and merged and others are waiting acceptance.

Examples: * #33762 * 1001d2b9e7e630c005cb24d5c43211f5d47d1b11 * 48fc51cb7b39b2d02c5b4b7adcc904962180a54b

RobotLimeLtd commented 8 years ago

Similar problem with salt-call... exits with 0 even when there's a serious error, e.g. "Unable to render top file". --retcode-passthrough does not remediate this.

Best solution is to tee the output to a file, and then grep for error messages.

edgan commented 8 years ago

Give me a working example and I will see what I can do

rickh563 commented 8 years ago

ZD-935

edgan commented 8 years ago

I fixed about one of these bugs, #35964 .

saltstack / salt

salt exit codes #18510

Lack of usability