We clobber error codes far too often

ethomson commented 11 years ago

We use a pattern like this far too often:

if (git_func_call() < 0)
    return -1;

This ends up clobbering the actual error code returned. While playing with some of the http transport stuff, I recently attempted to fail from a user callback, expecting GIT_EUSER to get returned to the calling function. It did not, due to a chain of calls using the above pattern. I propose that we audit the code base for error handling pre-1.0. We should:

Ensure that we propagate error codes, replacing the aforementioned pattern with:
```
if ((error = git_func_call()) < 0)
   return error;
```
Ensure that we giterr_set() any place that we return a new error code (instead of propagating an existing one, above.)
Ensure that we giterr_clear() any place that we ignore errors. In a perfect world this would be unnecessary since we would always set a new one when we returned one. In practice, little mistakes can cause strange errors - when trying to fail in a custom auth callback, I got an error -1 returned and a message telling me that there was a configuration value not found.

Should we:

Should we geterr_set() when we return GIT_EUSER, just to have something in the buffer? Or more importantly, so that we don't have a lingering message from point 3 above?
Think about profiling our error handling? That probably sounds crazy but I'm a little curious about how non-error cases are being error-handled. Stick a breakpoint in giterr_set and see how many times we're sprintfing something like The configuration value '%s' was not found or File not found: %s from stat calls when running the tests. Obviously we have a config value cache which mitigates a lot of these sorts of things in production, but I wonder if anybody has already looked at this more in depth and / or thinks that we should.

ethomson commented 11 years ago

If we get consensus here, I will make a list of the files to audit fancy github-flavored checkbox style and hopefully I can get some help on this. ;)

ben commented 11 years ago

I'll help. :+1: From a pedantically SemVer standpoint, this doesn't have to happen before 1.0, but it should definitely happen.

arrbee commented 11 years ago

When we re-did the error handling infrastructure about 18 months ago, we tried to move away from item 2 - i.e. higher level functions should not overwrite the error messages from lower-level functions and should not re-invoke giterr_set. There are a small number of exceptions that we've allowed in, mostly in cases where it has simplified the error handling logic in the parent function (i.e. some code paths are reporting the error for the first time and some others may be overwriting a lower-level giterr_set), but I don't think it should become policy to do so.

That said, I'm all for items 1 and 3 (i.e. propagate error codes upward and clear the error if you ignore an error code).

For the sake of code cleanliness and the keep this project somewhat smaller, I'd like to make an exception for error codes returned by our low level data structures (i.e. git_buf, git_vector, git_array, khash, etc.) where the only possible errors are -1 for out-of-memory or in some few circumstances GIT_ENOTFOUND. I think we should allow you to say:

if (git_buf_printf(&buf, "foo %s", strvar) < 0)
    return -1;

and not have to worry about returning the exact error code. This is actually the policy that I follow in most of the code that I write - propagate errors from anything that deals with Git objects or any sort of user-configured system, and use simple code for anything that is purely dealing with simple internal data structures.

What do you think about that?

ethomson commented 11 years ago

Re item 2: I'm not suggesting that we giterr_set to overwrite, I'm suggesting that the originator of the error always calls giterr_set. There are places where we just return -1 without setting any error and I would like to avoid that. There are a lot of times where we end up with customers reporting errors like "libgit2 returned -1, no error message was set" or whatever that gets reported as.

Re git_bufs and the like, I tend to agree that going out of our way to preserve an error code that could only ever been an OOM is not real useful. I think it probably makes some sense to create a list of funcs that we are pretty sure will never return anything but -1 on an OOM failure and not sweat those too much.

arrbee commented 11 years ago

I agree with you regarding returning -1 to the user without setting an error. Any instance of that should be considered a bug. Do you know any places that still do that?

One exception to the "never return -1 without setting an error" is in the POSIX p_ portability wrappers. Those functions should not set an error because on some platforms they are just going to be macros wrapping the native version. The caller of p_ functions should always assume that they must check and report the error themselves, I think. There may be a couple places that break this rule right now, and I think we should fix them.

As for identifying functions that can be called without propagating the exact error, I'd rather avoid a single monolithic list. Since we use compiler options requiring prototypes for all functions, maybe we can just comment on the function prototypes re the possible returned error codes (with no comment == anything might be returned so you must propagate)?

ethomson commented 11 years ago

I think that there are a bunch of return -1 and not-set-error in the remote and transport code.

Comments make sense to identify this, indeed.

arrbee commented 11 years ago

:sob:

vmg commented 11 years ago

Man, error handling is hard.

ethomson commented 11 years ago

For what it's worth, I may be wrong and that code is doing a very good job. I started beating on it in the aforementioned PR, then stopped and opened this issue instead because I didn't want to creep the scope too much.

ethomson commented 11 years ago

Let me ask a probably contentious question: can we get rid of string errors entirely and just use the entire negative part of a 32 bit int to hold our errors codes? Similar to errno, windows error codes, GSSAPI error codes, etc...

That is to say, instead of:

if (p_unlink("file") < 0) {
    giterr_set(GITERR_SOMETHING, "Could not delete  'file'");
    return -1;
}

we do something like:

if (p_unlink("file") < 0)
    return GIT_ERR_COULD_NOT_DELETE_SOMETHING;

With a great many very expressive error codes.

Pluses:

No more string manipulation just to set an error code. It's especially nice that we don't have to sprintf into a buffer every time a config variable is not found.
No more TLS for error values.
Easily localizable

Minuses:

At the very top of the stack, it may be hard to isolate the problem. GIT_ERR_INDEX_LOCKED is pretty self-explanatory, but GIT_ERR_COULD_NOT_WRITE_FILE when you're doing a pull may not be.

arrbee commented 11 years ago

I guess the question is whether the specific path or URL or SHA or what-have-you is ever of value to us as developers in understanding what went wrong. I think it is often is. That said, there are three audiences for errors and we should talk about addressing the needs of both audiences:

End-users of applications that use libgit2 internally - these users don't want to know about libgit2 at all, for the most part, and just want to know what went wrong. The application will have to understand the context of what the user was trying to accomplish and provide a meaningful error for that context - libgit2 cannot possibly know that. The existing error messages from libgit2 are presumably of no value to this group. But this really leads to audience 2...
Developers of end-user applications looking to present an error to the end-user - this audience wants to get enough data from libgit2 to understand what went wrong and translate the error into something meaningful to the end-user. They are unlikely to use the text of an error message that we provide, save for extracting meaningful bits of data that there is no other way to get.
Developers trying to debug libgit2 usage - of course, this audience overlaps with group 2. This is the case where the actual text of the libgit2 error message is useful and the more details that it can give the better. For this use case, it would probably be nice if we could even provide the file and line where the error originated along with our explanatory text and relevant values of variables that produced the failure.

When a user encounters an error, the text of the libgit2 error is not very useful, but when a developer is diagnosing a user-reported bug in the tool that uses libgit2, that text is often useful (and could be made even more so).

ethomson commented 11 years ago

Unfortunately, we are basically always displaying the libgit2 error message, though. We try to string-match the error messages so that we can turn them into something useful (and localized), but this is a giant hack and our string matching is totally incomplete. So we often provide error messages like "Error occurred in libgit2: -1, some crazy internal message that nobody cares about."

I would argue that we should set discrete error codes as results and use git_trace that we're not currently using in the library for people trying to debug libgit2 usage. That allows free-form messages to go back to callers in a way that they have configured, and use error codes.

arrbee commented 11 years ago

So, I think the git_trace discussion is probably the one we should have.

I'm open to having error details come through two channels - a git_trace channel with detailed parameter data and a richer set of error codes. In cases where the string formatted error code only contains data that the immediate caller if guaranteed to know (e.g. you cannot call git_config_get without knowing the name of the key you are trying to get, so there is no point in sprintf'ing it into the message, at least for the immediate parent caller).

Unfortunately, if we just replace giterr_set(...) with git_trace__write_fmt(GIT_TRACE_ERROR, ...) we lose the "no more string manipulation" advantage, we complicate code that wants to ignore errors (because now would you have the higher-level code suppress the trace callback or what to indicate that it doesn't care about the error?), and we have introduced a new maintenance hassle with keeping an expanded registry of error scenarios across the code base.

Thinking about this some more, it seems like there are two orthogonal solutions at play here:

Add more error return codes!
- + gives more detail to caller without parsing error message
- + helps with localization
- + not dependent on TLS storage
- - has to be maintained and documented somewhere
- - still doesn't give granular detail about files/urls/refs/etc. that caused error
Replace giterr_set with something else!
- + callback can ignore TLS
- + might be possible to avoid string formatting
- + depending on design, might be possible to dynamically disable many error / trace messages
- - complex to provide sufficient context to actually avoid string formatting (e.g. how does error site know that error will be ignored)
- - still needs to operate in threaded environment, so some thread context will probably be needed

I think we could do the first without taking on the second. There is no requirement to call giterr_last and if we just increase the richness of the errors then we expand the number of circumstances where the actual text of the error message can be safely ignored.

Some data: I count just over 500 places in the code base where giterr_set is called currently and about 120 .c files. The file with the most calls to giterr_set is currently src/transports/winhttp.c with 47 separate calls (35 unique).

If we wanted to encode each giterr_set call as a unique error code, that would easily fit into the return code. We could also use some bits for the location and some bits for the error code itself as there is a significant amount of duplication, but it becomes trickier to manage the organization and maintenance of the data.

Alternatively, we could group the error codes into a hierarchy - e.g. -1 to -99 as basic error codes, -100 to -199 for various parse errors (expected int, bool, oid, etc, in context of config, attr, object, refspec, etc.), -200 to -299 for file system errors, -300 to -399 for network errors, and so on.

Anyhow, these are just ideas. If we go back to first principles, I think we're trying to solve for:

Ensure error codes are propagated back to the caller
Make it easier to provide a localized end-user relevant error message
Make is possible to ignore the thread-local data without losing too much detail
Reduce the cost of ignoring an error

1 seems indisputable. 2 is relevant to almost all users and definitely seems worth addressing. 3 is beneficial to some bindings, I know, and probably will depend on the application and audience. 4 can be addressed in a number of ways, either through configurable trace levels or just through simple optimization, and we should probably benchmark before pursuing it all too far.

jamill commented 11 years ago

We do not perform string matching on the error in determining whether or not show our own error message. Unfortunately we do show the error string from libgit2 quite often and this is a cause for some pain.

As mentioned by @arrbee above, there are times when the error string provided by libgit2 does present useful granular information (e.g. if there is a problem accessing a folder, the error message will indicate the path that is causing issues). This information seems like it would be useful to an end-user and would be nice to preserve in some way.

A consistent mechanism to translate error codes into accurate error messages that can be localized would be ideal (numbers 1 and 2 in the list of first principles).

arrbee commented 11 years ago

Doing a quick scan of the sources, I see the following entities being passed for formatting in error messages:

paths
branch / reference / submodule names
branch type (i.e. "local" vs "remote")
OIDs
various numeric values (e.g. invalid index of nth parent of commit, number of files with conflicts in an attempted checkout, line number at which a parse error was encountered)
config setting keys and values (as strings) when they fail to pass some validation

I think the most complex giterr_set call that found was in config_file.c that takes a message, a path, a line number, and a column.

Okay, so I'm not saying the following is a good idea, but I would like to walk through an idea to eliminate sprintf and replace error messages with error codes.

If we say that every error has a primary message which is captured in a table and a "subject" which contains a string, an OID, and a value (ssize_t maybe?). The interpretation of the subject would be dependent on the specific error. The primary error message could be externalized into a separate table. Error reporting APIs and constants could look like:

enum {
    GIT_OK = 0,
    GIT_ERROR = -1,
    ...

    GIT_ERR_BLOB = -100,
    GIT_ERR_BLOB_CANT_READ_FILE_FOR_STREAM = -101,
    GIT_ERR_BLOB_CANT_READ_SYMLINK = -102,

    GIT_ERR_BRANCH = -200,
    ...
};

/* The old "magic" behavior of GITERR_OS could be replaced by an explicit
 * boolean indicating if the error should also capture the current errno
 * setting or if the error is independent of the errno value
 */

extern int giterr_set(int error_code, bool use_errno);
extern int giterr_set_s(int error_code, bool use_errno, const char *strvalue);
extern int giterr_set_v(int error_code, bool use_errno, ssize_t numvalue);
extern int giterr_set_id(int error_code, bool use_errno, const git_oid *oid);
extern int giterr_set_details(
    int error_code, bool use_errno, const char *strvalue, ssize_t numvalue, const git_oid *oid);

typedef struct {
    int     error_code;
    int     errno;
    char   *extra_str;
    ssize_t extra_val;
    git_oid extra_oid;
} git_error;

The error codes would be separately mapped to strings via a simple translation file:

GIT_ERR_BLOB_CANT_READ_FILE_FOR_STREAM "Failed to read file into stream"
GIT_ERR_BLOB_CANT_READ_SYMLINK "Failed to create blob.  Can't read symlink"

The error code would be returned from the giterr_set call as well, so you could do things like:

    /* from src/blob.c */

    read_len = p_readlink(path, link_data, link_size);
    if (read_len != (ssize_t)link_size) {
        git__free(link_data);
        return giterr_set_s(GIT_ERR_BLOB_CANT_READ_SYMLINK, true, path);
    }

    /* from src/config.c */

    if (pos == -1)
        return giterr_set_v(
            GIT_ERR_CONFIG_INVALID_LEVEL, false, (ssize_t)level);

This last example points out one problem with this approach. The old code returned GIT_ENOTFOUND - there we just a small number of error values so when writing code to use the API it is fairly easy to say "only one of a small number of values will appear so I can easily write my code to handle those values." With this proposal, there is an explosion of values and so each call to a libgit2 API has to be written with special handling just for the various error return values that are unique to that particular API. I'm not sure how much this will improve usability.

carlosmn commented 11 years ago

This puts me in mind of a piece of code from Subversion that a friend once showed me, where they have about ten lines of comparing a return code against all the possible error codes which have a different name and value depending on whether APR set it or one of the modules that actually did the work; but all of which end up meaning the same. It probably wouldn't get that bad here as we don't have that many different components.

If we were to keep the error class separate instead of integrating it with the , we could achieve a similar level of specificity while making it easier to find out what kind of error we have and we'd have to deal with less values even if we do add a bunch more specific ones.

On the subject of replacing the TLS errors with something else, I'd like to propose having the error object as an output parameter of the function (I might have proposed this in the past). This is how I worked around it in libgit2/git2go#43 though in that case it unfortunately had to be with some macros. This would eliminate global storage, allow us to chain errors in debug mode (we could do this now in theory, but it would be more magical), eliminate reporting of left-over errors and highlight wherever we don't clear these errors we're ignoring. The allocation might be more expensive for the errors we end up ignoring, though.

ben commented 11 years ago

Error Categories?

What if the return code from public APIs was a broad error category? Each specific error code should map to only one category, so giterr_set_* could handle the mapping and produce the right return value. So:

if (pos == -1)
  return giterr_set_v(
    GIT_ERR_CONFIG_INVALID_LEVEL, false, (ssize_t)level);

…would return GIT_ENOTFOUND.

Rich Returns

@cmn, I'm assuming your suggestion would look something like this:

const git_error *err;
if (git_external_api(&err, foo, "bar")) {
  // Do something with the error
  git_error_free(err);
}

I sort of like this for all the reasons @cmn mentioned. It's better than just returning a pointer to an error structure that the caller has to free, because at least they can ask to not get it (by passing NULL), and the error setters could avoid allocating anything at all in that case. We can work around the allocation overhead by having a resizable pool of them, with free flags.

carlosmn commented 10 years ago

Keeping the mapping for the error categories could get cumbersome, but it would keep the general error codes which make it easier to decide.

I'd put the error as the last param, as it's something that's always there, but yeah, that's what I was thinking. The cool thing you can do with this is

git_error *err;
if (git_doit(repo, "yolo", &err) < 0) {
    return git_rethrow(err_out, err, "No yolo for you");
}

which in debug mode could keep a chain of errors but only keep the last one for release mode.

arrbee commented 10 years ago

So, I threw my earlier comments out as a thought experiment. I'm happy to see some other approaches.

To be honest, none of these appeal to me at all. They all seem to add complexity with some interesting but relatively mild benefits.

I think we should fix propagation errors and potentially add a new config API that doesn't set an error for a missing key since that is by far the most commonly ignored error.

We can also look at using a string localization / externalization library to extract messages. Particularly if that layer is pluggable, that could solve most of the need to capture & catalog error scenarios without adding undue complexity to development or library usage.

ethomson commented 10 years ago

I am in violent agreement with you here, @arrbee .

vmg commented 10 years ago

Sorry it took me so long to get to this thread! I was busy tagging releases. :p

I agree that most of the solutions

The only way I could get behind a major restructuring of the error handling APIs is if (and only if) we could get rid of TLS completely. All the presented solutions have some minor advantages, but they don't really get us rid of TLS, so I'm having a hard time backing those up.

@arrbee's point that config makes up for 99% of the cases where we set errors for no reason is very solid. I think we definitely need an internal set of config APIs for querying.

Regarding internationalization, giterr_ could be made a macro, and it would be trivial to localize the strings. Personally, I'm not too sure of the value of intl'd strings, but if people need it, we can implement it.

Either way, I still eagerly wait for solutions that would get us rid of TLS.

ethomson commented 10 years ago

To be pedantic, I think two of the solutions (return ints or giterrs, neither need TLS) do get rid of a thread local error: it's just that they also suck for other reasons (less specificity, freeing an error code).

carlosmn commented 10 years ago

Returning the highly-specific ints removes the need for TLS if we do away with the details of the error message (such as line number in config.c). If we want a message and no TLS, then we need to carry the error message up the stack manually.

IIRC some of the config functions didn't use to set an error message, but we added it because everything that returns an error is meant to set the error as well. We could make the backend not set the error (and wrap that with .e.g git_config_get_string_gently() and have git_config_get_string() wrap that and set the error message. Same for looking up objects in the DB, where we set the error for each object we can't find for a backend.

ethomson commented 10 years ago

To clarify one point: not setting error strings as aggressively in the config layer is not very high-pri for me. I haven't actually profiled this and don't any suggestion that it's particularly bad, I only noticed it because I set a breakpoint in giterr_set trying to figure out who was setting a totally unrelated error message and was annoyed by the number of non-error errors getting set and my continue-button-pressing finger got tired.

Obviously I'd rather not do it, but I can think of about 100 higher pri things for me.

carlosmn commented 10 years ago

It's probably not that bad in the larger scheme of things, but it is very annoying when braking on giterr_set() when none of the first 20 calls are actually errors, but just config values that don't exist or objects which are in the packfiles.

arrbee commented 10 years ago

I've started playing around with a branch that adds a "do not error on missing entry" config API for internal usage. I don't want to spend a lot of time on it, but I'd like to do a small benchmark of the number of calls to giterr_set doing a simple set of actions with and without the changes (like maybe testing with the "status" example on a typical workload). Mostly just a little experiment.

arrbee commented 10 years ago

I also started looking at our use of GIT_EUSER and how consistently we call giterr_clear() before returning to the user. That is fraught with problems - namely, there are places we use some of these functions internally and calling giterr_clear() is actually resulting in a loss of error data. As a result, some places are returning GIT_EUSER without being as careful about clearing the error prior to returning. I'm thinking about what will be a consistent behavior we can implement internally that will not be hard to code but will give the expects end-user results...

arrbee commented 10 years ago

Okay, so I started to address some of this in PR #1986. It's just a start, but I think it has some merit.

arrbee commented 10 years ago

I feel like we've made big strides in this area and continue to improve on reliable error propagation. Do you want to keep this issue open as a reminder to do a more thorough audit of how we're doing or should be close it?

arrbee commented 10 years ago

Closing this - feel free to reopen if there are specific actions to take next.

libgit2 / libgit2

We clobber error codes far too often #1959

Error Categories?

Rich Returns