Closed ethomson closed 10 years ago
If we get consensus here, I will make a list of the files to audit fancy github-flavored checkbox style and hopefully I can get some help on this. ;)
I'll help. :+1: From a pedantically SemVer standpoint, this doesn't have to happen before 1.0, but it should definitely happen.
When we re-did the error handling infrastructure about 18 months ago, we tried to move away from item 2 - i.e. higher level functions should not overwrite the error messages from lower-level functions and should not re-invoke giterr_set
. There are a small number of exceptions that we've allowed in, mostly in cases where it has simplified the error handling logic in the parent function (i.e. some code paths are reporting the error for the first time and some others may be overwriting a lower-level giterr_set
), but I don't think it should become policy to do so.
That said, I'm all for items 1 and 3 (i.e. propagate error codes upward and clear the error if you ignore an error code).
For the sake of code cleanliness and the keep this project somewhat smaller, I'd like to make an exception for error codes returned by our low level data structures (i.e. git_buf
, git_vector
, git_array
, khash
, etc.) where the only possible errors are -1
for out-of-memory or in some few circumstances GIT_ENOTFOUND
. I think we should allow you to say:
if (git_buf_printf(&buf, "foo %s", strvar) < 0)
return -1;
and not have to worry about returning the exact error code. This is actually the policy that I follow in most of the code that I write - propagate errors from anything that deals with Git objects or any sort of user-configured system, and use simple code for anything that is purely dealing with simple internal data structures.
What do you think about that?
Re item 2: I'm not suggesting that we giterr_set
to overwrite, I'm suggesting that the originator of the error always calls giterr_set
. There are places where we just return -1
without setting any error and I would like to avoid that. There are a lot of times where we end up with customers reporting errors like "libgit2 returned -1, no error message was set" or whatever that gets reported as.
Re git_buf
s and the like, I tend to agree that going out of our way to preserve an error code that could only ever been an OOM is not real useful. I think it probably makes some sense to create a list of funcs that we are pretty sure will never return anything but -1
on an OOM failure and not sweat those too much.
I agree with you regarding returning -1
to the user without setting an error. Any instance of that should be considered a bug. Do you know any places that still do that?
One exception to the "never return -1
without setting an error" is in the POSIX p_
portability wrappers. Those functions should not set an error because on some platforms they are just going to be macros wrapping the native version. The caller of p_
functions should always assume that they must check and report the error themselves, I think. There may be a couple places that break this rule right now, and I think we should fix them.
As for identifying functions that can be called without propagating the exact error, I'd rather avoid a single monolithic list. Since we use compiler options requiring prototypes for all functions, maybe we can just comment on the function prototypes re the possible returned error codes (with no comment == anything might be returned so you must propagate)?
I think that there are a bunch of return -1 and not-set-error in the remote and transport code.
Comments make sense to identify this, indeed.
:sob:
Man, error handling is hard.
For what it's worth, I may be wrong and that code is doing a very good job. I started beating on it in the aforementioned PR, then stopped and opened this issue instead because I didn't want to creep the scope too much.
Let me ask a probably contentious question: can we get rid of string errors entirely and just use the entire negative part of a 32 bit int to hold our errors codes? Similar to errno, windows error codes, GSSAPI error codes, etc...
That is to say, instead of:
if (p_unlink("file") < 0) {
giterr_set(GITERR_SOMETHING, "Could not delete 'file'");
return -1;
}
we do something like:
if (p_unlink("file") < 0)
return GIT_ERR_COULD_NOT_DELETE_SOMETHING;
With a great many very expressive error codes.
Pluses:
sprintf
into a buffer every time a config variable is not found.Minuses:
GIT_ERR_INDEX_LOCKED
is pretty self-explanatory, but GIT_ERR_COULD_NOT_WRITE_FILE
when you're doing a pull may not be.I guess the question is whether the specific path or URL or SHA or what-have-you is ever of value to us as developers in understanding what went wrong. I think it is often is. That said, there are three audiences for errors and we should talk about addressing the needs of both audiences:
When a user encounters an error, the text of the libgit2 error is not very useful, but when a developer is diagnosing a user-reported bug in the tool that uses libgit2, that text is often useful (and could be made even more so).
Unfortunately, we are basically always displaying the libgit2 error message, though. We try to string-match the error messages so that we can turn them into something useful (and localized), but this is a giant hack and our string matching is totally incomplete. So we often provide error messages like "Error occurred in libgit2: -1, some crazy internal message that nobody cares about."
I would argue that we should set discrete error codes as results and use git_trace
that we're not currently using in the library for people trying to debug libgit2 usage. That allows free-form messages to go back to callers in a way that they have configured, and use error codes.
So, I think the git_trace
discussion is probably the one we should have.
I'm open to having error details come through two channels - a git_trace
channel with detailed parameter data and a richer set of error codes. In cases where the string formatted error code only contains data that the immediate caller if guaranteed to know (e.g. you cannot call git_config_get
without knowing the name of the key you are trying to get, so there is no point in sprintf'ing it into the message, at least for the immediate parent caller).
Unfortunately, if we just replace giterr_set(...)
with git_trace__write_fmt(GIT_TRACE_ERROR, ...)
we lose the "no more string manipulation" advantage, we complicate code that wants to ignore errors (because now would you have the higher-level code suppress the trace callback or what to indicate that it doesn't care about the error?), and we have introduced a new maintenance hassle with keeping an expanded registry of error scenarios across the code base.
Thinking about this some more, it seems like there are two orthogonal solutions at play here:
+
gives more detail to caller without parsing error message+
helps with localization+
not dependent on TLS storage-
has to be maintained and documented somewhere-
still doesn't give granular detail about files/urls/refs/etc. that caused errorgiterr_set
with something else!
+
callback can ignore TLS+
might be possible to avoid string formatting+
depending on design, might be possible to dynamically disable many error / trace messages-
complex to provide sufficient context to actually avoid string formatting (e.g. how does error site know that error will be ignored)-
still needs to operate in threaded environment, so some thread context will probably be neededI think we could do the first without taking on the second. There is no requirement to call giterr_last
and if we just increase the richness of the errors then we expand the number of circumstances where the actual text of the error message can be safely ignored.
Some data: I count just over 500 places in the code base where giterr_set
is called currently and about 120 .c
files. The file with the most calls to giterr_set
is currently src/transports/winhttp.c
with 47 separate calls (35 unique).
If we wanted to encode each giterr_set
call as a unique error code, that would easily fit into the return code. We could also use some bits for the location and some bits for the error code itself as there is a significant amount of duplication, but it becomes trickier to manage the organization and maintenance of the data.
Alternatively, we could group the error codes into a hierarchy - e.g. -1 to -99 as basic error codes, -100 to -199 for various parse errors (expected int, bool, oid, etc, in context of config, attr, object, refspec, etc.), -200 to -299 for file system errors, -300 to -399 for network errors, and so on.
Anyhow, these are just ideas. If we go back to first principles, I think we're trying to solve for:
1 seems indisputable. 2 is relevant to almost all users and definitely seems worth addressing. 3 is beneficial to some bindings, I know, and probably will depend on the application and audience. 4 can be addressed in a number of ways, either through configurable trace levels or just through simple optimization, and we should probably benchmark before pursuing it all too far.
We do not perform string matching on the error in determining whether or not show our own error message. Unfortunately we do show the error string from libgit2 quite often and this is a cause for some pain.
As mentioned by @arrbee above, there are times when the error string provided by libgit2 does present useful granular information (e.g. if there is a problem accessing a folder, the error message will indicate the path that is causing issues). This information seems like it would be useful to an end-user and would be nice to preserve in some way.
A consistent mechanism to translate error codes into accurate error messages that can be localized would be ideal (numbers 1 and 2 in the list of first principles).
Doing a quick scan of the sources, I see the following entities being passed for formatting in error messages:
I think the most complex giterr_set
call that found was in config_file.c that takes a message, a path, a line number, and a column.
Okay, so I'm not saying the following is a good idea, but I would like to walk through an idea to eliminate sprintf
and replace error messages with error codes.
If we say that every error has a primary message which is captured in a table and a "subject" which contains a string, an OID, and a value (ssize_t
maybe?). The interpretation of the subject would be dependent on the specific error. The primary error message could be externalized into a separate table. Error reporting APIs and constants could look like:
enum {
GIT_OK = 0,
GIT_ERROR = -1,
...
GIT_ERR_BLOB = -100,
GIT_ERR_BLOB_CANT_READ_FILE_FOR_STREAM = -101,
GIT_ERR_BLOB_CANT_READ_SYMLINK = -102,
GIT_ERR_BRANCH = -200,
...
};
/* The old "magic" behavior of GITERR_OS could be replaced by an explicit
* boolean indicating if the error should also capture the current errno
* setting or if the error is independent of the errno value
*/
extern int giterr_set(int error_code, bool use_errno);
extern int giterr_set_s(int error_code, bool use_errno, const char *strvalue);
extern int giterr_set_v(int error_code, bool use_errno, ssize_t numvalue);
extern int giterr_set_id(int error_code, bool use_errno, const git_oid *oid);
extern int giterr_set_details(
int error_code, bool use_errno, const char *strvalue, ssize_t numvalue, const git_oid *oid);
typedef struct {
int error_code;
int errno;
char *extra_str;
ssize_t extra_val;
git_oid extra_oid;
} git_error;
The error codes would be separately mapped to strings via a simple translation file:
GIT_ERR_BLOB_CANT_READ_FILE_FOR_STREAM "Failed to read file into stream"
GIT_ERR_BLOB_CANT_READ_SYMLINK "Failed to create blob. Can't read symlink"
The error code would be returned from the giterr_set
call as well, so you could do things like:
/* from src/blob.c */
read_len = p_readlink(path, link_data, link_size);
if (read_len != (ssize_t)link_size) {
git__free(link_data);
return giterr_set_s(GIT_ERR_BLOB_CANT_READ_SYMLINK, true, path);
}
/* from src/config.c */
if (pos == -1)
return giterr_set_v(
GIT_ERR_CONFIG_INVALID_LEVEL, false, (ssize_t)level);
This last example points out one problem with this approach. The old code returned GIT_ENOTFOUND
- there we just a small number of error values so when writing code to use the API it is fairly easy to say "only one of a small number of values will appear so I can easily write my code to handle those values." With this proposal, there is an explosion of values and so each call to a libgit2 API has to be written with special handling just for the various error return values that are unique to that particular API. I'm not sure how much this will improve usability.
This puts me in mind of a piece of code from Subversion that a friend once showed me, where they have about ten lines of comparing a return code against all the possible error codes which have a different name and value depending on whether APR set it or one of the modules that actually did the work; but all of which end up meaning the same. It probably wouldn't get that bad here as we don't have that many different components.
If we were to keep the error class separate instead of integrating it with the , we could achieve a similar level of specificity while making it easier to find out what kind of error we have and we'd have to deal with less values even if we do add a bunch more specific ones.
On the subject of replacing the TLS errors with something else, I'd like to propose having the error object as an output parameter of the function (I might have proposed this in the past). This is how I worked around it in libgit2/git2go#43 though in that case it unfortunately had to be with some macros. This would eliminate global storage, allow us to chain errors in debug mode (we could do this now in theory, but it would be more magical), eliminate reporting of left-over errors and highlight wherever we don't clear these errors we're ignoring. The allocation might be more expensive for the errors we end up ignoring, though.
What if the return code from public APIs was a broad error category? Each specific error code should map to only one category, so giterr_set_*
could handle the mapping and produce the right return value. So:
if (pos == -1)
return giterr_set_v(
GIT_ERR_CONFIG_INVALID_LEVEL, false, (ssize_t)level);
…would return GIT_ENOTFOUND
.
@cmn, I'm assuming your suggestion would look something like this:
const git_error *err;
if (git_external_api(&err, foo, "bar")) {
// Do something with the error
git_error_free(err);
}
I sort of like this for all the reasons @cmn mentioned. It's better than just returning a pointer to an error structure that the caller has to free, because at least they can ask to not get it (by passing NULL), and the error setters could avoid allocating anything at all in that case. We can work around the allocation overhead by having a resizable pool of them, with free
flags.
Keeping the mapping for the error categories could get cumbersome, but it would keep the general error codes which make it easier to decide.
I'd put the error as the last param, as it's something that's always there, but yeah, that's what I was thinking. The cool thing you can do with this is
git_error *err;
if (git_doit(repo, "yolo", &err) < 0) {
return git_rethrow(err_out, err, "No yolo for you");
}
which in debug mode could keep a chain of errors but only keep the last one for release mode.
So, I threw my earlier comments out as a thought experiment. I'm happy to see some other approaches.
To be honest, none of these appeal to me at all. They all seem to add complexity with some interesting but relatively mild benefits.
I think we should fix propagation errors and potentially add a new config API that doesn't set an error for a missing key since that is by far the most commonly ignored error.
We can also look at using a string localization / externalization library to extract messages. Particularly if that layer is pluggable, that could solve most of the need to capture & catalog error scenarios without adding undue complexity to development or library usage.
I am in violent agreement with you here, @arrbee .
Sorry it took me so long to get to this thread! I was busy tagging releases. :p
I agree that most of the solutions
The only way I could get behind a major restructuring of the error handling APIs is if (and only if) we could get rid of TLS completely. All the presented solutions have some minor advantages, but they don't really get us rid of TLS, so I'm having a hard time backing those up.
@arrbee's point that config makes up for 99% of the cases where we set errors for no reason is very solid. I think we definitely need an internal set of config APIs for querying.
Regarding internationalization, giterr_
could be made a macro, and it would be trivial to localize the strings. Personally, I'm not too sure of the value of intl'd strings, but if people need it, we can implement it.
Either way, I still eagerly wait for solutions that would get us rid of TLS.
To be pedantic, I think two of the solutions (return int
s or giterr
s, neither need TLS) do get rid of a thread local error: it's just that they also suck for other reasons (less specificity, freeing an error code).
Returning the highly-specific int
s removes the need for TLS if we do away with the details of the error message (such as line number in config.c
). If we want a message and no TLS, then we need to carry the error message up the stack manually.
IIRC some of the config functions didn't use to set an error message, but we added it because everything that returns an error is meant to set the error as well. We could make the backend not set the error (and wrap that with .e.g git_config_get_string_gently()
and have git_config_get_string()
wrap that and set the error message. Same for looking up objects in the DB, where we set the error for each object we can't find for a backend.
To clarify one point: not setting error strings as aggressively in the config layer is not very high-pri for me. I haven't actually profiled this and don't any suggestion that it's particularly bad, I only noticed it because I set a breakpoint in giterr_set
trying to figure out who was setting a totally unrelated error message and was annoyed by the number of non-error errors getting set and my continue-button-pressing finger got tired.
Obviously I'd rather not do it, but I can think of about 100 higher pri things for me.
It's probably not that bad in the larger scheme of things, but it is very annoying when braking on giterr_set()
when none of the first 20 calls are actually errors, but just config values that don't exist or objects which are in the packfiles.
I've started playing around with a branch that adds a "do not error on missing entry" config API for internal usage. I don't want to spend a lot of time on it, but I'd like to do a small benchmark of the number of calls to giterr_set doing a simple set of actions with and without the changes (like maybe testing with the "status" example on a typical workload). Mostly just a little experiment.
I also started looking at our use of GIT_EUSER and how consistently we call giterr_clear()
before returning to the user. That is fraught with problems - namely, there are places we use some of these functions internally and calling giterr_clear()
is actually resulting in a loss of error data. As a result, some places are returning GIT_EUSER without being as careful about clearing the error prior to returning. I'm thinking about what will be a consistent behavior we can implement internally that will not be hard to code but will give the expects end-user results...
Okay, so I started to address some of this in PR #1986. It's just a start, but I think it has some merit.
I feel like we've made big strides in this area and continue to improve on reliable error propagation. Do you want to keep this issue open as a reminder to do a more thorough audit of how we're doing or should be close it?
Closing this - feel free to reopen if there are specific actions to take next.
We use a pattern like this far too often:
This ends up clobbering the actual error code returned. While playing with some of the http transport stuff, I recently attempted to fail from a user callback, expecting
GIT_EUSER
to get returned to the calling function. It did not, due to a chain of calls using the above pattern. I propose that we audit the code base for error handling pre-1.0. We should:Ensure that we propagate error codes, replacing the aforementioned pattern with:
giterr_set()
any place that we return a new error code (instead of propagating an existing one, above.)giterr_clear()
any place that we ignore errors. In a perfect world this would be unnecessary since we would always set a new one when we returned one. In practice, little mistakes can cause strange errors - when trying to fail in a custom auth callback, I got an error-1
returned and a message telling me that there was a configuration value not found.Should we:
geterr_set()
when we returnGIT_EUSER
, just to have something in the buffer? Or more importantly, so that we don't have a lingering message from point 3 above?giterr_set
and see how many times we'resprintf
ing something likeThe configuration value '%s' was not found
orFile not found: %s
fromstat
calls when running the tests. Obviously we have a config value cache which mitigates a lot of these sorts of things in production, but I wonder if anybody has already looked at this more in depth and / or thinks that we should.