Enhancement: `set -e` is ignored in functions when used in the "if" context

mvo5 commented 4 years ago

First let me say THANK YOU for the wonderful shellcheck tool. I use it daily and love it.

For new checks and feature suggestions

[x] https://www.shellcheck.net/ (i.e. the latest commit) currently gives no useful warnings about this
[x] searched through https://github.com/koalaman/shellcheck/issues and didn't find anything related

I found https://github.com/koalaman/shellcheck/issues/1484 which is slightly similar but also different enough I feel to justify it's own issue. If I'm wrong please just close as duplicate.

Here's a snippet or screenshot that shows the problem:

#!/bin/sh
set -e

foo() {
    set -e
    false
    echo "output after failed command even with set -e"
}

if foo; then
    echo
fi

Here's what shellcheck currently says:

$ shellcheck myscript No issues detected!

Here's what I wanted or expected to see:

It would be nice if it would warn that if shell functions are used inside an "if" any set -e is disabled while the function runs. I added the "set -e" in foo() just for demonstration purposes, it's does not change anything.

marc-h38 commented 3 years ago

If the exit code of foo is sometimes tested with if foo and elsewhere not, would you warn or not?

kerolasa commented 3 years ago

I played with the issue a bit, and got myself more confused. set -e is not inherited it needs to be set again within function. Okay, But set -E does not seem to be inherited either. If I add a trap within function (uncomment below line to see effect) it will be active later outside function scope - where return is not right way to exit.

#!/bin/bash
set -e # exit on errors
set -E # inherit ERR trap
trap 'echo trap; exit 1' ERR

foo() {
        set -e
        #trap 'echo function trap; return 1' ERR
        false
        echo "output after failed command even with set -e"
        return 0
}

if foo; then
        echo "still going"
        false
fi

Tested using bash version 5.0.18(1)-release.

kurahaupo commented 3 years ago

Functions can be nested in un-obvious ways; consider:

set -e

f() {
    do_something  # is this exempt?
}
g() {
    echo starting "$1"
    "$@"
}

if g f
then echo Happy
else echo Sad
fi

or

set -e

f() {
    do_something  # is this exempt?
}
g() {
    echo starting "$1"
    if "$@"
    then r=$? # 0
    else r=$?
    fi
    echo finished "$1"
    return "$r"
}

g f

And functions aren't the only way this can happen, they're just the most obfuscating:

Any block with a test condition that's a LIST can exhibit this issue:

# shell equivalent of C's "do-while" loop:
while
 some_command        # unintentionally exempt from set -e and ERR trap
 ((some==test))      # condition for loop
do :; done

Anything inside a compound statement when the whole statement is tested:

a subshell:

foo=$( some_cmd     # un?intentionally exempt from set -e and ERR trap
   another_cmd  # intentionally exempt from set -e and ERR trap
  ) || return

My view is that set -e may be useful for linting and regression testing, but if you're relying on it in production, you're doing it wrong. Rather, critical operations should be checked explicitly:

abort() {
    r=$?
    printf >&2 'Unexpected return code %u when %s\n' "$r" "$1"
    exit "$r"
}

something_complicated                          || abort 'complicated thing failed'

If you're too busy/lazy/in-a-hurry to do anything else, simply preface any critical operations with a standard checker:

critical() {
    "$@" || abort "when attempting $( printf '[%s]' "$@" ) at $( caller 1 )"
}

critical cd "$dir"

marc-h38 commented 3 years ago

"Check every exit code" is nice on paper but never happens in practice; no one is happy to triple their code size. Even if you are very careful and never too busy, someone else will be eventually. If your code can absolutely not tolerate any failure then switch to a safer language: Makefile, Python, etc. Otherwise a combination of: 1. manually checking the most risky commands 2. errexit for an admittedly incomplete bit of extra safety; is not perfect but good enough and like it or not it's popular and successful.

kurahaupo commented 3 years ago

@marc-h38: My overarching point is set -e can sometimes be actively harmful, and the assumption that it's always benign is simply wrong. The inherent assumption that non-zero status means "critical failure" is simply wrong, and quite irrepairable.

The worst harm done by set -e is to the minds of users: it lulls them into a false sense of security.

Advocates for set -e never point out that using it can also cause failures, especially in unusual cases that may not show up during regression testing. Nor do they point out that it's not guaranteed to stop dangerous operations after critical prerequisites have failed.

I'm not recommending manually checking everywhere; I'm saying pay attention and do check where the exit status indicates failure, and don't check where the exit status does not indicate a critical failure.

Given a choice between marking the innocuous commands whose exit status can be ignored, and marking the critical ones, I prefer to err on the side of readability rather than writability: highlight the critical operations, and leave everything else as simple as possible.

marc-h38 commented 3 years ago

My overarching point is set -e can sometimes be actively harmful,

I believed this harmful myth for a long time. Fortunately for me, I eventually saw evidence of the opposite and then successfully used set -e in production for a couple of decades and it saved me and others a lot of time.

You're trying to convince me that something that others and I are using and that you are not is not possible. Good luck?

The inherent assumption that non-zero status means "critical failure" is simply wrong, and quite irrepairable.

The number of unix commands that return non-zero on success is surprisingly small and very easily dealt with.

Advocates for set -e never point out that using it can also cause failures, especially in unusual cases that may not show up during regression testing. Nor do they point out that it's not guaranteed to stop dangerous operations after critical prerequisites have failed.

Never saw anyone claiming that set -e works consistently; we have been reading very different documentations.

kurahaupo commented 3 years ago

We're veering a long way off the issue, but...

I believed this harmful myth for a long time. Fortunately for me, I eventually saw evidence of the opposite and then successfully used set -e in production for a couple of decades and it saved me and others a lot of time.

I believe the largest harm is to the mindset of the programmer. Conversely, knowing that one is working without guard rails on sharpens the mind.

I used to believe as you do now, that set -e would be more help than hindrance, but stopped using it after I experienced a number of quite catastrophic failures where error recover from some minor problem was broken by set -e, leading on one occasion to irreparably deleting over 10 million customer emails.

I've been scripting for a couple of decades since then, and avoiding set -e and applying alternative rigorous approaches has saved me and others a lot of time and money.

The point is not simply to stop using set -e. If the choice is between "sloppy programs without set -e" and "sloppy programs with set -e" then the later would obviously be an improvement. My point is to replace set -e with something better.

My non-exhaustive list of "something better" would include:

Use linting tools (such as Shellcheck) often, preferably every editor save, or every git commit;
- Don't expect Linting tools to solve the halting problem (which is what this feature request amounts to)
Apply skill testing: You have to be this tall to play.
- If you can't explain a script, you shouldn't be modifying it.
- If you don't understand expansions, assignments, quoting, word-splitting, and globbing, you shouldn't be modifying any shell script.
Have peers review code;
- If a script seems to use "quotes" to mean "strings", its author probably failed the previous point. The ability to use quotes selectively shows that they understand the quoting rules, and so they're actually less likely to get them wrong in ways that confuse the linting tools.
Avoid known anti-patterns, and apply good practices;
- Include comments explaining why you need to use an exception, to the satisfaction of the code reviewer.
- Rigorously identify which lines are prerequisite checking, actions, and cleanup, as they have different error-handling characteristics;
- Use || die "explanation" after critical code - or an equivalent that suits your style - so that users won't be left in the dark when things fail, and so that reading down the code shows the non-failure path.
- As far as reasonable, avoid mixing code & data, so that linting tools can work better, and so humans can read the code more easily.
Avoid prescriptive rules based on misunderstandings or overreach, as they will cause resentment and be ignored.
- For example, don't prohibit ((var=expr)) on the grounds that it will fail under set -e when the expression is zero. Instead, recommend that, like any other command whose non-zero status is innocuous, it should be prefaced by ! to avoid triggering an abort.
Do use set -e on sections of code where you know what "failure" means, and where the consequences of simply aborting the shell are acceptable. Inside certain subshells, for instance. Just don't set it blindly at the top of the script.

All that said, if set -e stops people writing code like this:

somecmd
if [ $? = 0 ] ; then ...

then maybe I should change my mind after all. :smiley:

marc-h38 commented 3 years ago

Conversely, knowing that one is working without guard rails on sharpens the mind.

Then great because set -e is far from a complete guard rail. Keeps my mind sharp.

error recover from some minor problem was broken by set -e, leading on one occasion to irreparably deleting over 10 million customer emails.

It's hard to comment without knowing any specific but... if deleting 10 million emails (without any backup?!) was just one set -e away, then it feels like either the wrong language was used, or some of your "better practices" had not been applied, or all of the above.

Long story short and trying to get back to shellcheck: some people are happier without set -e in production, others are happier with it: just run sudo grep -r 'set -e' /etc and see for yourself. Maybe try considering set -e as a different dialect that shellcheck has to make a (reasonable) effort to support because of how popular it is. There are so many shell dialects already anyway.

Not sure why people using set -e in production seem more tolerant of those who don't than the other way round. BTW ~none~ most of your very good practices and advice seem incompatible with set -e; always the case with people recommending "alternatives" to set -e.

kurahaupo commented 3 years ago

I like your idea of considering it a separate dialect, for the purposes of Shellcheck.

Then great because set -e is far from a complete guard rail. Keeps my mind sharp.

Sadly, that's not how some folk think of it. Management tells them to turn on set -e "to make scripts safe", and so having done so, they think they're forevermore "safe".

Not sure why people using set -e in production seem more tolerant of those who don't than the other way round.

Let me tell you (more of) a story:

As is always the case with such critical failures, our deletion event was the result of multiple failures.

There were backups, but the deletion was spread out over such a long period, and generated such little reaction by customers, that the relevant backups had expired before they could be called on.
We already knew that xargs without -0 is not the answer, but we learned the hard way that xargs without -r is never the answer either. :frowning_face:
We didn't have regression testing - in part because we didn't have any test infrastructure (what we'd used for pre-prod testing was now part of production), but mostly because we didn't really expect there to be any changes. (At least, not below the level of whole system replacement.)

There was detection and recovery code that would have stopped this from blowing up, but it was tested before someone in management decided that set -e must be used in new code, and some low-level tech mistook that as a directive to retrofit it to all code, and did so with insufficient (no?) testing, and a rubber-stamp peer review.

Perhaps that's why I now rail so strongly against unequivocal recommendations to use set -e.

(I like to keep this horror story up my sleeve precisely to prevent that misstep.)

In an ideal world, the effects of both set -e and set +e could be limited in scope, to a source file, a function, or an eval; then at least it would be safe -ish to intermix code that uses both approaches.

dharmab commented 3 years ago

set -e may be considered harmful because it has inconsistent behavior across different versions of a given shell:

See http://mywiki.wooledge.org/BashFAQ/105.G, https://fvue.nl/wiki/Bash:_Error_handling, https://www.in-ulm.de/~mascheck/various/set-e/ for real examples.

Personally, I'd love to be able to toggle between a set -e compatible mode for legacy scripts and a set -e disallowed mode for new scripts.

TinCanTech commented 3 years ago

May be shellcheck should just write your code for you ?

koalaman / shellcheck