Error handling in CLI apps

killercup commented 6 years ago

In the first meeting, we've discussed that we should have a best practice for error handling in CLI apps.

killercup commented 6 years ago

From the etherpad:

Not all errors are created equal. For example, I/O pipe errors.
- Providing some official guides on common footguns would go a long way (closed pipes, etc.)
stabilize ? in main (https://github.com/rust-lang/rust/issues/43301)
- help out implementing and testing this issue
have a good story/guidelines around which exit codes to return
failure is looking good but does not yet have clear messaging/documentation about how to architect/structure error handling in your app

TeXitoi commented 6 years ago

Giving a main template. I didn't find anything interesting when I decided to use the failure crate and thus reinvented a basic main with error handling:

fn main() {
    let opt = Opt::from_args();
    env_logger::init();
    if let Err(err) = run(opt) {
        for cause in err.causes() {
            eprintln!("{}", cause);
        }
        std::process::exit(1);
    }
}
fn run(opt: Opt) -> Result<()> {
    // ...
    Ok(())
}

Screwtapello commented 6 years ago

Does "error handling" include non-fatal errors i.e. warnings?

matthiasbeyer commented 6 years ago

@Screwtapello this is about runtime errors, not about compiletime errors, I suppose.

Screwtapello commented 6 years ago

Yes, runtime non-fatal errors, like "timed out connecting to host, retry 1 of 3" or "first line of input contains invalid values, assuming they're headers and skipping them".

matthiasbeyer commented 6 years ago

But there is no such concept as a "warning" in Rust. We are talking about errors in the sense of std::error::Error or Result::Err(_) and how to handle them. A "warning", as you describe it, would be a way to handle certain errors. So either I'm misunderstanding something or the answer is simply: Yes, "warnings" are part of this discussion, as they are basically errors handled in a special way.

TeXitoi commented 6 years ago

I think @Screwtapello talk about warning as "an Err, but that I don't want to propagate, but I don't want to silently discard". I have this use case too. For the moment, I use the log crate for that.

TeXitoi commented 6 years ago

Something like that: https://github.com/CanalTP/navitia_model/blob/88eec02bbdf15d980b34c4507f5331a34a663f21/src/ntfs/read.rs#L234-L244

killercup commented 6 years ago

I'd say yes, non-fatal errors, like "log the failure and ignore/try again", are something we should consider. (It's also something I'd like to improve this quicli guide which does something like that.)

jugglerchris commented 6 years ago

I use logging, but also have a utility to capture any warnings during the execution of a function/closure (I'm using using slog-scope); the caller then gets to decide whether to be strict (if any warnings then print/abort), lenient (ignore the warnings) or somewhere in between ("There were some issues during parsing; do you want to continue anyway?").

technetos commented 6 years ago

failure provides with_context and context. These methods extend rusts Result type and make it really easy to structure excellent error messages that chain together nicely when propagated upwards with ?. There is also the bail! and throw! macros provided by failure that let you propagate up either custom error messages using strings or custom error types.

I have built a decently sized cli app using failure and I would be interested in documenting how to structure a project using failure so that you can propagate errors up and fail fatally or capture them where they happen and act accordingly in a non-fatal way. Would that be done as a PR here or somewhere else? Or would that be considered part of the failure documentation?

Edit: I see that some of this is already underway in the failure repo. Ill look there.

Mark-Simulacrum commented 6 years ago

"Exit status should reflect success or failure" from https://github.com/rust-lang-nursery/cli-wg/issues/21 reflects on this too; I've never been confident using process::exit in some deeply nested function, and always wanted some way to unwind and exit, perhaps via Result, but haven't come up with a good pattern yet.

technetos commented 6 years ago

@Mark-Simulacrum I have found that having most of your functions return Result<T, SomeErrorTrait> combined with heavy use of ? gets you a safe predictable way to percolate errors up to main and exit cleanly.

Mark-Simulacrum commented 6 years ago

It'd be good to document a way to propagate up an exit code to the main function, though I suppose arguably the main function should itself decide what error code to return? I'm not sure what the typical pattern in C/C++ or other languages is as to what code should be responsible for determining which exit status to set.

luser commented 6 years ago

I'm not sure what the typical pattern in C/C++ or other languages is as to what code should be responsible for determining which exit status to set.

IME there's very little convention around this in the C/C++ space. Plenty of existing C programs just call exit(status) in arbitrary places (if not abort()!). C's dismal error handling likely contributes to this. The ? in main RFC has a section with some background on exit status code conventions. I think following the conventions proposed there is totally sensible, such that all you need to do is chain Result<T, Error> up to main.

That RFC also discusses a print_diagnostics_for_error method, so the best bet here might be to ensure that returning something that implements failure::Fail from main prints a useful error message so that doing the right thing is as simple as:

use failure::Error;

fn work() -> Result<(), Error> { ...}

fn main() -> Result<(), Error> {
  work()
}

kalefranz commented 6 years ago

IMHO, error handling in CLI apps shouldn't be all that much different than error handling in web apps. In web app frameworks, you typically insert a middleware-type error handler in your web request call stack. The error handler makes sure stack traces and such don't leak out when the app is running in production mode, but then can give various diagnostic information (django even puts you into a live debugger session) when exceptions occur.

CLI apps can also have exception handlers. For example, if you were to create a custom Error type in your application somewhere, you could give that error type an exit_code field that's respected by the error handler for you application.

Another use of error handlers in CLI apps is to determine what to output, what form, and where, when an exception occurs. Let's say my app is designed for other programs to subprocess to, and I support a --message-format=json flag. Instead of writing the more human-friendly error message to stderr, the flag would trigger a json-structured error written to stdout.

Yet another use of an error handler is a CLI application is to let your application upload details of unexpected runtime errors to some web endpoint for automatic error reporting (once your users opt-in, of course).

uazu commented 6 years ago

To me a CLI app is not intended to be long-lived. It is supposed to do its job and then exit. So just outputting an error message to stderr and returning a non-zero exit status is enough for very many apps. Mostly you can count on the OS to do cleanup. However, for bigger and more complicated apps that perhaps require more complicated cleanup, then yes it makes sense to propagate errors up to main.

Note that there are some valid reasons for using exit statuses apart from 0 and 1. An example might be differentiating soft and hard failures, which can be useful to recognise easily from a script, i.e. to check $? instead of parsing output.

artem-zinnatullin commented 6 years ago

Totally agree with https://github.com/rust-lang-nursery/cli-wg/issues/12#issuecomment-374811986 and few other comments pro-error propagating

Maybe we can document reasoning of why functions/methods other than fn main or ones used as helper functions in fn main shouldn't directly exit program so people naturally decide to do that:

It makes unit testing of such functions pretty much impossible since they terminate the process that runs test. And we should encourage unit testing to make sure we maintain high quality of CLI apps written in Rust 😸
It complicates re-usability of such functions. In some cases process exit might be ~ok for the caller, but in others caller function might want to handle error/exit differently and for instance switch to a fall-back but it's impossible if function terminates the process.

Possible solutions:

Error/exit can be popped to the caller function as a Result:

fn parse_args(rawArgs) -> Result<ParsedArgs, Error> {
  if args.bad {
       Error(…)
    } else {
       ParsedArgs(…)
    }
}

Object-oriented inversion can be used like that:

fn parse_args(rawArgs, &mut parsedArgs, exiter) { 
    if rawArgs.bad {
       exiter.exitWithError(code, message))
    } else {
       parsedArgs = …
    }
}

Of course, I'd vote for functional-style with Result as it keeps most of functions pure. But in some cases like logging a warning from deeply nested function can be easier with object-oriented style.

@uazu

So just outputting an error message to stderr and returning a non-zero exit status is enough for very many apps. … However, for bigger and more complicated apps that perhaps require more complicated cleanup, then yes it makes sense to propagate errors up to main.

Hope unit testing is a good reason to switch to error propagation even in small CLI apps :)

uazu commented 6 years ago

Hope unit testing is a good reason to switch to error propagation even in small CLI apps :)

No, it isn't a good reason. Really it would be best to test CLI apps as they are intended to be run, i.e. as separate processes. So make unit tests do fork/exec. There is a lot you'd have to fake up to run main() in the same process and not have the app-under-test notice, and to survive everything that might go wrong and not crash your test-runner.

However there are other reasons such as making valgrind output useful, which I could agree with. Sometimes, though, tearing down all your memory structures and closing everything is just completely unnecessary and a waste of processing time, and you just want to exit().

Edit: I hope I didn't put that too strongly. Yes, tearing down everything and returning to main does have its uses, but exit() does too. I've seen people massively overengineer stuff that should be ever so simple to code. So allow simple things to be simple. However if you're writing a huge app, then structure and rules are definitely going to be your friends rather than your enemy.

artem-zinnatullin commented 6 years ago

Why not both 😸

Really it would be best to test CLI apps as they are intended to be run, i.e. as separate processes.

Well, you can say that for all the code, even a library is intended to eventually be run as part of some process :) Doesn't mean it shouldn't have unit-tests.

Splitting code into small pure/-ish functions and unit-testing them greatly helps in development and maintaining quality.

Moreover, you can treat 99% of the code of a CLI app as a library(s) and fn main() as a user of that library(s).

However I totally agree that CLI apps should ideally have Integration/Functional tests as well 👍. As you said it is the original purpose of them — to be run as separate process.

Here is an example of CLI Rust project with both Unit and Integration/Functional tests: Mainframer

Unit tests in Rust src/
Integration/Functional tests in Bash test/
How we run them in Docker ci/build.sh

uazu commented 6 years ago

Okay, got it. Yes, anything that's coded like a library needs to follow library-like rules (like passing back errors, and having independent tests).

The scenario I'm working with is small (2000-line) CLI-style apps that run on the end of pipes that we use to isolate our main server process from hardware driver issues. On failure, they should report to their output and wait to be killed. If they are unable to report the error or anything else goes wrong they must die immediately, which will be recognised by the parent process. Doing a full teardown is risky. A close may block indefinitely. We want the OS to do the cleanup. In this case, whilst we may have some parts structured like libraries, other parts clearly aren't, and for very good reason will call exit() on failure.

This to me is closer to the UNIX concept of CLI apps as disposable components, e.g. you ^C a whole pipeline without ever thinking about how that is dealt with. Most will just exit immediately and the OS will do cleanup. This is instantaneous. An app that doesn't die immediately is the (irritating) exception to the rule.

epage commented 6 years ago

@BurntSushi

CLI tools should exit gracefully when a "broken pipe" error occurs.

From https://github.com/rust-lang-nursery/cli-wg/issues/21

Is there anything we need to do here to help with that?

BurntSushi commented 6 years ago

@epage It doesn't happen by default. If you treat it like any other error, then you're likely to exit with the wrong code. If you ignore it completely (which will happen if you're using println!), then your program will happily march along even though the consumer of your process's output has hung up and is no longer listening.

I don't know how to fix this necessarily, but at the very least, it could be documented somewhere that it is a footgun.

kamalmarhubi commented 6 years ago

I think something as simple as a library function that returns Read / Write wrappers that treat broken pipe as EOF could work?

uazu commented 6 years ago

For output to a broken pipe on UNIX, you'd normally get a SIGPIPE signal. Or does something disable that? (See: man 7 pipe)

BurntSushi commented 6 years ago

Yes. I believe std disables/masks it.

uazu commented 6 years ago

@BurntSushi Okay, I've found it: https://github.com/rust-lang/rust/blob/f5631d9ac7745dd6eaea2bc6c236d5f8e54e9a18/src/libstd/sys/unix/mod.rs#L71

Perhaps the CLI WG needs to make a decision about this, and other UNIX signals too. (I don't know how things work on Windows.) There are two cases as far as I can see:

CLI tools that don't need to clean up anything special. These can handle SIGINT, SIGTERM, SIGPIPE, etc by just letting the OS cleanup the process. So the OS will close files, pipes, streams etc and release all memory -- but unflushed data in software buffers won't get written. This is fine for very many CLI apps.
CLI tools that need to do a little bit of network communication before terminating, or that need to flush to files or databases, or do other cleanup. These need to catch and handle SIGINT, SIGTERM, etc, and recognise and handle errors (e.g. errno=EPIPE) everywhere they may occur

I can imagine that for consistency, you might want to recommend using the second option everywhere, for example for those people who want to see all errors propagate back to main. This causes a bit more work. In that case handling signals and EPIPE needs to be made consistent (e.g. println! mentioned above). Also you need to consider how to handle SIGINT (^C). The signal handler could set an internal flag that some loop could recognise, and then propagate an error back to main. However probably you'd need to make ^C^C do an exit() to give the user a way out just in case nothing notices that the flag has been set. So really trying to catch the signals makes more work, but if you need it to do cleanup then it is necessary, and if you want control to always come back to main it is also necessary.

Does anyone have any thoughts or ideas about this?

uazu commented 6 years ago

I had a quick look at errors getting silently dropped. Testing here on Linux, println! doesn't drop the EPIPE, but panics instead, which is better than ignoring it. But there's at least one other case of silently dropping errors: FileDesc::drop(). Perhaps there are other examples. (Would it be worth having a ticket to build up a list of these?)

So there may be various sources of errors which are outside the normal error propagation path:

SIGINT or SIGTERM for the case where we don't want the OS to terminate us automatically (i.e. user presses ^C or another process does a kill to our PID)
Unreportable errors, e.g. FileDesc::drop
Maybe SIGPIPE too

So these could be handled by a single mechanism:

Have a short queue of errors stored in a unsafe global
Add something to the queue on ^C or unreportable error or whatever
Have a function to check whether there was an error queued and return it
Call that function in the main loop of the CLI app somewhere, and propagate errors back to main

This isn't very nice really, but what other mechanism do we have to deal with stuff like this?

(Also, you're quite limited in what you can call in a signal handler (see man signal-safety on Linux). So maybe a global flag may be the only way for SIGINT and SIGTERM, rather than trying to lock an unsafe global array.)

BurntSushi commented 6 years ago

I had something simpler in mind. Basically, either of these behaviors are undesirable:

Ignore all errors when writing to stdout.
Panic when an error occurs while writing to stdout.

The problem is that (2) happens by default if you use println!, and (1) is an easy thing to slide into if you replace println!(...); with let _ = writeln!(io::stdout(), ...);. I think it is really unfortunate that println! does the wrong thing here, which basically means that one almost never wants to use println! in a CLI tool. You might be able to get away with it if you're only printing a small amount of output, but otherwise, it's a footgun that can appear any time you pipe the output of your tool into head, for example.

There are two "simple" solutions to this:

Either propagate all errors back up to main. Every CLI tool I write mostly does this automatically, because it is actually the path of least friction IMO. If the error is a pipe error, then quit gracefully, otherwise exit with an error code and show the error.
Reinstall the default signal handling for SIGPIPE, e.g., libc::signal(libc::SIGPIPE, libc::SIG_DFL). This will cause the process to terminate immediately upon receipt of a SIGPIPE. No actual signal handler is needed.

Either of these two solutions resolve the actual footgun here. Going beyond that to clean-up logic and all that is a totally separate issue. And yeah, I do think signal handlers deserve their own issue as well, because the state of those in Rust isn't great at the moment.

FWIW, I can't remember the motivation for changing the default signal handler for SIGPIPE. @alexcrichton Do you know?

alexcrichton commented 6 years ago

FWIW, I can't remember the motivation for changing the default signal handler for SIGPIPE. @alexcrichton Do you know?

This was initially done long long ago for a reason that may or may not still be relevant. I think that if you do a syscall and would otherwise get EPIPE a program will raise SIGPIPE first. For example if you write to a closed socket it'll by default kill the program. We weren't too happy about that (we'd rather socket writes returned an error on errors), so we changed the default signal handler.

Nowdays though for sockets at least we use things like MSG_NOSIGNAL and other techinques to work even when SIGPIPE is still raised by default (aka writing to a socket should always return an error). This may still affect things like writing to child pipes of processes, though.

This also mirrored what libuv did long long ago but it may no longer be true today

alexcrichton commented 6 years ago

For the pipe issue, this program only prints one error because of SIGPIPE rather than two

uazu commented 6 years ago

The topic here is error-handling in general. Perhaps it needs splitting out into separate issues. I've added #27 and #28 for signal handling and silently dropped errors. (Please someone close these if they're inappropriate, but it seems necessary to split things.)

What options are there for fixing println! to return errors, or is this completely impossible due to backwards compatibility? If all errors were caught and returned, then that seems much better than using a handler for SIGPIPE, because you get more information and it's local.

uazu commented 6 years ago

There seem to be various not-directly-reportable errors: SIGINT, SIGTERM, and FileDesc::drop() for a start. Maybe also println! failures if it's not possible to change the return value or common usage of println! and we want to avoid panics on EPIPE. A general mechanism to queue these errors internally within std and then query them from user code would help with all of these cases. I'm bringing this idea back here because #28 doesn't seem the right place for it.

rust-cli / team

Error handling in CLI apps #12