proposal: errors: let GODEBUG=errstacktrace request stack backtraces

ianlancetaylor commented 11 months ago

Errors in Go serve two main purposes: reporting problems to the user of the program, and reporting problems to the developer of the program. The user essentially never wants to see stack traces. For a developer, they can be helpful. Thus it is reasonable for developers to use packages like https://pkg.go.dev/github.com/go-errors/errors to collect developer errors--errors that the user is not expected to see.

That said, while developing a program, there are times when it is helpful for a user error to include a stack trace. For example, this is useful when some program bug is causing an unexpected user error. Getting a stack trace may help speed up debugging. This can also be useful when a user reports an inexplicable user error that is repeatable with proprietary user data; here again a stack trace may help the developer debug the problem.

Therefore, I propose a new GODEBUG setting: GODEBUG=errstacktrace=1. When this GODEBUG entry is set in the environment, the errors.New and fmt.Errorf functions will change to incorporate a stack trace into the message. This stack trace will be part of the string returned by the Error method (which will now be a lengthy, multi-line, string).

This will permit developers debugging an obscure but repeatable problem to set an environment variable to request a stack trace for the error that they are seeing.

We could also consider providing a simple mechanism for other error generation packages to check whether this GODEBUG setting is turned on.

This was inspired by a comment by @fsaintjacques : https://github.com/golang/go/issues/60873#issuecomment-1745646649.

apparentlymart commented 11 months ago

The general idea of this resonates with me, but it feels strange to me that this would be a runtime-changeable setting rather than a compile-time-changeable setting. In my application it would mean that setting this environment variable would expose the detail of which end-user-facing error messages are being derived from error.Error() results and which are hand-written specifically for our application.

I tend to bristle a bit each time the Go runtime gains another way to globally change some runtime/library behavior out from under my application, since that adds yet another dimension of variation to consider when someone reports some unexpected behavior, but I suppose I am more sensitive to that than most Go application developers given that I use Go to develop CLI tools for which it should be irrelevant to the user what language they are written in. Probably it's more common to use Go to write server processes that are designed to run only in a particular execution environment and where everybody knows well that the program is written in Go.

This is honestly a more broad piece of feedback than just on this proposal, and so I should probably record it somewhere else :thinking: but I would enjoy some way to compile an application such that it does not respond to GODEBUG feature-switching at all, but have a way to explicitly turn on some of these features programmatically based on mechanisms that make more sense for my application, such as the application's configuration file. Then I would expose only the ones I know about and can see a reason for a user to turn on, and exclude any that have questionable value to my application.

Since I've already contributed a large number of words to this issue across two comments I'm going to extend this comment to include a response to Ian's later comment about it being an advantage to extend the error string, even though that makes this response appear out of order, just because it's an extension of my existing point above.

I consider it to be a disadvantage that it would affect the behavior of existing programs that have not been modified to request or expect stack traces, since some error messages are written for users and turning this on would significantly change those error messages.

Even though it's opt-in for a user, there are some applications (such as my own) where the user is not the person that ought to be making this decision, or if it is the user making the decision then it would be better to make it in a way that is more natural for the application.

fsaintjacques commented 11 months ago

Note that the real reason I want this is to report errors to some error reporting services like sentry. Thus the stacktrace must be retrievable via some programatic way, either a public function in errors, or via unwraping to an exported type, i.e. not string parsing.

This type of functionality is common for long running services were we want to report exceptions / unexpected errors. As a single data point, here's how common it is in our monorepo...

 $ ag -i '(fmt.Errorf|errors.New|errors.Wrap)' pkg | wc -l
10573

Once we encounter an unexpected error, we have both grpc/http middlewares auto-magically extracting the stack trace and submitting it to sentry. As a success criteria of this change, the previous line count should be drastically reduced. In other words, most callsites should be changed from

if err := myFn(...); err != nil {
  return errors.Wrap("myFun: %w", err)
}

To simply

if err := myFn(...); err != nil {
  return err
}

apparentlymart commented 11 months ago

The above comment from @fsaintjacques makes me consider a different design, which also happens to be a solution to my earlier feedback. :smile:

Building on the design approach for errors.Is and errors.As with special optional methods that error implementations could implement, add errors.StackTrace(err error) *Stacktrace which:

Tests whether err has a method ErrorStackTrace() *Stacktrace. If not, immediately returns nil.
If so, return err.ErrorStackTrace()

The default situation is that no errors have backtraces.

When GODEBUG=errstacktrace is enabled, the errors.New and fmt.Errorf functions include a stack trace in the generated error value. The types returned by both implement the ErrorStackTrace method as returning the saved stack trace, if present. Whether they do that by always returning a type containing a *Stacktrace field that will often be nil or by returning a different type depending on whether the stack trace functionality is enabled is an implementation detail up to the implementer, as far as I can tell.

The Error() function for each of these types does not change at all: it just returns the specified message, as before.

If nothing in the program ever passes an error to errors.StackTrace then the stack trace collection is just some wasted time collecting a stack trace nobody will refer to. But a program that does want to capture a stack trace, for example to send it to Sentry as described above, can try to pass the error to errors.StackTrace and make use of the result whenever it's non-nil.

In the above I've assumed there's a material cost to capturing a stack trace and thus it would be too expensive to enable by default, but I've not actually measured that. If capturing a stack trace is relatively cheap then it would be nicer to not require the extra environment variable at runtime, but I don't know exactly how the Go runtime tracks stack traces internally.

mitar commented 11 months ago

As an author of the gitlab.com/tozd/go/errors stack trace attaching errors package, I think having an opt-in stack trace recording of standard errors would be great. But the current proposal has some issues:

I do not think it is reasonable or even workable to attach stack trace to Error string:

This makes it behave very differently when stack trace is attached or not. For example, how does fmt.Errorf("%w: extra", err) behaves, when err already contains a stack trace? It attaches extra to the end of the stack trace while without GODEBUG it does it to the end of the Error line.
Attaching a stack trace AND formatting it is expensive. Adding it blindly to all errors might introduce a lot of unnecessary work of string manipulation.

I think instead:

Standard errors should implement StackTrace() []uintptr interface which return nil without GODEBUG flag and what callers() provide when GODEBUG is set.
- Alternatively, standard errors could implement this interface only when GODEBUG is set.
This records a minimum of the stack trace and this is it. Then you can do with this stack trace information whatever you want. Also, the return type is simply []uintptr to not have another type defined.
Formatting of standard errors can then include a stack trace after the Error if formatting is done with %+v, which is kinda standard for this these days.

How exactly is recording of stack traces enabled, using GODEBUG or some other mechanism I will leave to others to discuss. But I feel strongly that my proposal above nicely combines with existing approaches while not breaking anything.

ianlancetaylor commented 11 months ago

The advantage of adding the stack trace to the Error string is that it works in a program that has not already been designed to work with error stack traces. If you have to change the code to print or otherwise handle the stack trace, then it seems to me that you may as well use an existing package that manages stack traces for you.

That said, I agree that it makes sense to add a way to retrieve just the stack trace from the error, if there is one.

I also agree that we should arrange that using %v or %w in a fmt.Errorf doesn't blindly copy the stack trace string. We should fetch the error string without the stack trace, format it, and then add in the stack trace. Right now I think it should just preserve the original stack trace, and not add a new one. Perhaps some more clever option is available.

fsaintjacques commented 11 months ago

I also agree with @mitar that the ideal signature is []uintptr, that's what we do internally.

fsaintjacques commented 11 months ago

In the above I've assumed there's a material cost to capturing a stack trace and thus it would be too expensive to enable by default, but I've not actually measured that. If capturing a stack trace is relatively cheap then it would be nicer to not require the extra environment variable at runtime, but I don't know exactly how the Go runtime tracks stack traces internally.

I would expect that this cost is paid in (hopefully) exceptional case, very rarely will you see an error frequently returned in a hot loop.

mitar commented 11 months ago

@ianlancetaylor OK, what about then: we do not change Error string at all, we only unconditionally add stack trace when the error is formatted when the flag is enabled (with whichever verb)? That might be the right trade-off because otherwise I can expect some tests might be failing if they call err.Error() and expect it to be in some structure (yes, bad programming, you should use errors.Is, etc., but still). But if you are doing fmt.Printf("error: %s", err) it is probably safer to include a stack trace?

neild commented 11 months ago

This proposal is to add stack traces to errors.

The Go 1.13 errors proposal was to capture a single stack frame, and allow error wrapping to identify the trace of locations producing the error. This approach reduces the size of error values and can represent a trace that passes across goroutine boundaries, but produces less information in the simplest case.

I think any proposal to add stack information to errors needs to consider full stacks vs. frames, and present an argument for which approach is used.

ianlancetaylor commented 11 months ago

@mitar Let me restate your suggestion. We keep the Error method as is, but change the fmt package (other than fmt.Errorf) so that in any case where we call the Error method, we also check for and call a Stacktrace method, and, if that method is not empty we call it and also print the stack trace. We change fmt.Errorf so that if it wraps an error it does something intelligent with any stack trace. Does that sound right?

@neild This proposal suggests an optional feature that we expect to only be enabled while debugging. As such, I am thinking it should capture a complete stack trace.

mitar commented 11 months ago

@ianlancetaylor: Sounds good. And for the "so that if it wraps an error it does something intelligent with any stack trace" bit, from my experience: if it wraps more than one error (or no errors), you should record a new stack trace, if it is just one error, you should inherit the stack trace. But good luck reasonably deal with stack traces when joining more than one error. :-) (In my package, I had to resort to indentation when formatting such errors to visually show which stack traces belong where, and there are more than one stack trace involved then.)

rittneje commented 11 months ago

This proposal seems to have a gap where sentinel errors come into the picture. That is, things like:

var ErrFoo = errors.New("blah blah blah")

func Bar() error {
    return ErrFoo
}

For one, that call to errors.New probably cannot attach a stack trace, at least not in any meaningful sense. But also, I kind of want the return to implicitly attach a stack trace, which is a thorny matter in legacy cases where people are using == instead of errors.Is.

It seems like under this proposal, I'd have to instead do:

var ErrFoo = errors.New("blah blah blah")

func Bar() error {
    return fmt.Errorf("%w", ErrFoo)
}

which seems kind of awkward. (Yes I probably want to do that anyway to force the use of errors.Is but still.)

fsaintjacques commented 11 months ago

@rittneje I had this specific gotcha as something that would be required, but I fail (I didn't really put the effort) to see how it could be easily avoided. And the be clear our internal library does something like this:

pcs := ...
if strings.HasSuffix(runtime.FuncForPC(pcs[0]).Name(), ".init") {
  // Don't wrap when using sentinel errors defined in global vars
  return err
}
return wrapWithStackTrace(err, pc)

which means that any sentinel defined errors don't have frames attached.

mateusz834 commented 11 months ago

Couldn't error stack traces be somehow implemented by the compiler itself? I would imagine having a build tag of some sort (or GOEXPERIMENT?) that makes the error interface to be more 'heavyweight' i.e. it will have more fields internally used for stack traces. Then the stack trace will be available through some built-in or a function in errors package.

And then each time a function returns a new error (detected as a conversion between concrete type to an error interface, or returning a global variable), the compiler would populate the internal error fields with the stack trace.

It would be nice if this would also somehow detect wrapping errors. Not sure how.

Don't know whether this is possible in go, but zig has Error return traces, maybe Go can grab some ideas from zig?

flibustenet commented 11 months ago

Errors are rarely errors, mostly values, i mean nothing exceptional. It would be overkill to add a trace to all errors, like io.EOF, sql.ErrNoRows, custom errors/values... It's why I like the idea of %w where we decide at each step in the code itself if we need to keep something or not and not globally.

inifares23lab commented 11 months ago

hi

To sum up what I understood the objectives are from the conversations:

standard/unbiased stacktraces as []uintptr
stacktrace string in error string
configurable stacktrace (on/off) (string/standard)
fine grained control over the configuration (for example we want to avoid meaningless stacktraces)
issue when using errors.Join or similar feature, do we climb the error tree or not?

hopefully I got it more or less right :sweat_smile:

My opinions:

I agree []uintptr should be the standard for the reasons stated above
- I think it has already been said but providing a machine readable format(like json) from []uintptr would be much more beneficial and could allow better automation for observability, incident response...
I agree this should be a formatting feature
- I think err.Error() should return only the string representation of the error, nothing more
I don't dislike the environment variable option but:
- it makes it difficult to have fine-grained configuration
- i.e.: 70% of code is beautiful, bugless and errors are nicely decorated(i want only plain error strings), the other 30% is a weird mess (i want the stacktraces)
- It's not easily configurable at runtime, why would we want to have to redeploy a server? the issue might not happen at a later time, the error decorations may not be optimal and we may have to spend lot of time just to guess where the error happened; runtime configuration can mitigate this.
I agree
no Idea what would be best but it also could be configurable

Possible solutions:

As I experimented here https://github.com/golang-exp/errors-poc, configuration can happen at runtime with an ErrorLevel akin to a LogLevel using methods similar to log/slog.

let this be the standard following what @mitar suggested satisfying marshaling and unmarshaling interfaces
maybe we can add a configurable function for formatting that defaults to err.Error() i.e. var String = func(err error) string and this can be picked up by fmt, logger,... when handling an error (??)
@ianlancetaylor the runtime configuration like log/slog would allow a simple line or two in the main function to have the same outcome of the GODEBUG=errorstacktrace=1 environment variable but with this configuration being dynamic (see here)
@rittneje also addressed here, it is a bit verbose but it's entirely optional, the code can become:

`var ErrFoo = errors.New("blah blah blah")

func Bar() error { return fmt.NewErrorFormatter(LEVEL_NO_STACK).Errorf("%w", ErrFoo) }`

this could also be configurable dynamically using the same techniques

Opaque I think I have seen the Opaque function being implemented in the future errors (or maybe I have seen it somewhere else); that's nice.

Pop It would be nice to have a Pop function extracting the last error struct{} whatever that may be(string, programCounter,..)

ErrorLevelVar with offsets: moreover different errorer or errorformatter instances could be initialized with an offset, this is to allow a single ErrorLevelVar to configure the defaults as well as other instances. example:

package X is very nice and clean and uses default errorer with default error level (i.e LEVEL_NO_STACK = 0),thus behaving as current errors
package Y is slightly messy the errorer instance here has an offset +1 from the default, thus it's generating stacktraces when `ErrorLevelVar is 0 or more
package Z is extremely messy the errorer instance here has an offset +2 from the default, thus it's generating stacktraces when ErrorLevelVar is -1 or more

PS:

this last part is not experimented in the package I wrote.
also I accounted for 4 levels (NO_STACK, PROGRAM_COUNTER, STACK_ON FIRST, FULL_STACK) the last two levels retrieve an array of callers and I actually find them kind of useless; on one hand they may be useful to somebody (maybe ??), on the other hand they would make it difficult to implement the offset solution.

mitar commented 11 months ago

This proposal seems to have a gap where sentinel errors come into the picture.

Yea. In gitlab.com/tozd/go/errors I had to introduce errors.Base for sentinel errors (without a stack trace) vs. errors.New (which get a stack trace) for this exact reason. I could not find a better solution.

Maybe it could differentiate between top-level (or init) call to errors.New and deeper call? Would anyone instantiate sentinel errors elsewhere?

And then each time a function returns a new error (detected as a conversion between concrete type to an error interface, or returning a global variable), the compiler would populate the internal error fields with the stack trace.

That is an interesting approach which otherwise cannot be done. Maybe something which makes sense with the expensive GODEBUG flag. I must admit that this issue made me hope that I could just run GODEBUG in production and get stack traces everywhere. But one of the initial motivations was that you use this only at exceptional cases, when you are debugging. Maybe then it is OK to have an overhead of attaching stack traces at every return if it does not already have it?

I think it has already been said but providing a machine readable format(like json) from []uintptr would be much more beneficial and could allow better automation for observability, incident response...

I think 3rd party libraries can provide tooling around []uintptr, both stack trace formatting, JSON marhsaling, etc. I do not think we have to bake this into stdlib. But you are right that having standard errors (if they contain stack traces for fmt formatting) could also embed stack traces if they are JSON marshaled. Not everyone formats and prints it out, some might send it as JSON. But the question then is: should this feature be used in production? But yea, having a standard JSON structure for stack traces could make things interoperable.

ianlancetaylor commented 11 months ago

I want to emphasize that the goal of this proposal is to permit developers to easily enable a way for errors to include stack traces. In particular, developers can ask users to do this when running on their own systems with their own data. Ideas that rely on rebuilding the program should be a different proposal. Ideas that rely on using a different set of calls, or a different set of packages, to get a stack trace should be a different proposal. Thanks.

earthboundkid commented 11 months ago

Changing the return value of Error() will break anything doing manual string matching. It's bad and you shouldn't do that, but some programs do it anyway, especially when you're introspecting a third party error you don't control. Adding a new method seems much less likely to create unexpected breakages.

krhubert commented 11 months ago

I like @mateusz834 suggested approach with zig Error return traces.

From a developer's perspective, I'd prefer to have the option to switch to errors with a full stack trace, even if it comes with a slight performance trade-off, as it enhances code readability. Let's consider the example of the CopyFile function. In doing so, we can observe that adding a stack trace to just one function would require from developers to use fmt.Errorf five times. If I have to implement a full stack trace in my entire application, it would necessitate modifications in thousands of places where I'd need to wrap err with fmt.Errorf.

func Copy(dst, src string) error {
        if dst == src {
                return fmt.Errorf("%w", ErrInvalid)
        }

        srcF, err := Open(src)
        if err != nil {
                return fmt.Errorf("%w", err)
        }
        defer srcF.Close()

        info, err := srcF.Stat()
        if err != nil {
                 return fmt.Errorf("%w", err)
        }

        dstF, err := OpenFile(dst, O_RDWR|O_CREATE|O_TRUNC, info.Mode())
        if err != nil {
                 return fmt.Errorf("%w", err)
        }
        defer dstF.Close()

        if _, err := io.Copy(dstF, srcF); err != nil {
                 return fmt.Errorf("%w", err)
        }
        return nil
}

AlexanderYastrebov commented 11 months ago

@krhubert

func Copy(dst, src string) (rerr error) {
        defer func() {
            if rerr != nil {
              rerr = fmt.Errorf("%w", rerr)
            }
        }()

        if dst == src {
                return ErrInvalid
        }

        srcF, err := Open(src)
        if err != nil {
                return err
        }
        defer srcF.Close()

        info, err := srcF.Stat()
        if err != nil {
                 return err
        }

        dstF, err := OpenFile(dst, O_RDWR|O_CREATE|O_TRUNC, info.Mode())
        if err != nil {
                 return err
        }
        defer dstF.Close()

        if _, err := io.Copy(dstF, srcF); err != nil {
                 return err
        }
        return nil
}

but would be nice to have this automatically added.

UPDATE: this would not capture return location

mitar commented 11 months ago

One side idea. Currently, when you call panic(err), that calls err.Error() when formatting an unrecovered panic. But some of those errors do carry additional data (e.g., stack traces). @ianlancetaylor Do you think there is some room to do some improvements here? E.g., maybe there could be optional PanicError() string method on errors so that they could control what is printed then? Probably doing whole fmt.Sprintf("%+v", err) is something to avoid? Or could panic handler check if error implements fmt.Formatter interface and only then call fmt.Sprintf?

I am writing this here because I was first thinking that fmt.Sprintf could be used when GODEBUG=errstacktrace is enabled, but now I am thinking that it should probably be something always available. So PanicError() string seems to me the best approach. Should I open a separate proposal for that?

Edit: I made the full proposal #63455.

abhinav commented 10 months ago

I want to echo @carlmjohnson's point about the risk with changing the return of Error() even on an opt-in basis. Yes, matching on Error() is bad but it exists in codebases—usually in tests, but sometimes in business logic. There tends to be mix of strings.Contains(err.Error(), ...) which won't be affected by this change and err.Error() == ... which will break.

A search across GitHub for instances of exact error string matches outside test code:

https://github.com/search?q=path%3A.go+%22err.Error%28%29+%3D%3D%22+-path%3A*_test.go&type=code

Some popular Go projects from that search: chaosmonkey, GitLab, nomad, istio. (My intent with those examples is not to call those projects out. I'm trying to highlight that this is a thing that happens even in well-established Go projects.)

StevenACoffman commented 9 months ago

Re: Zig Error Return Traces, which tracks the code path that an error took to get to the user.

Stacktraces are useful in figuring out problems with how an unexpected error happened, but Error Return Traces are useful in figuring out problems in how an expected error was handled (flow control) and surfaced to the end user.

Stack traces have a large initial cost, while error return traces scale with each frame through which an error is returned.

There’s a library for Error Return Traces in Go: https://github.com/bracesdev/errtrace

It seems to me that an ideal hybrid solution would have non-sentinel New() errors attach a StackTrace, but any subsequent error wrapping would add a single frame reference for a Return Error Trace.

Similarly, initial wrapping of a sentinel error would attach a StackTrace, but subsequent wrapping would add a single frame reference for a Return Error Trace.

flibustenet commented 9 months ago

@StevenACoffman #60873 Was from this same idea ? We should find a way to distinguish errors as pure value like EOF that doesn't need any frame attached from errors that need to be analysed with the actual place where they was returned.
I believe also that it's where we return an error that we know if we need to attach/annotate this error with the frame and nothing more (not all the stack). We already can do this return fmt.Errorf("I'm in file xxx.go line:5 and return this error: %v", err).

inifares23lab commented 9 months ago

@flibustenet @StevenACoffman

I agree, the full stack trace is really useful only with the first error in the stack. And one may not always want to include a frame at all.

On another note: I do not think the frame should ever be in the error message..(except for ugly patches :))

what if for example I want to share the error with an external service, to obfuscate the frame I would need to manipulate the string (not very nice)
also, good compatibility with previous versions is a big strength of go, therefore the error string should remain unchanged to avoid a lot of refactoring in a lot of projects, and other reasons I guess..(this was already mentioned somewhere in this chat I think).
if anything, changing the error message could be an optional thing, to avoid said refactoring. I don't see much value in this. I think It should be the logging library that outputs the informations I want/need (i.e.: frame,..).

Personally my favourite decisions would be:

err.Error() output should remain unchanged
the frame should be a [xx]unitptr for first error in the stack and [1]uintptr for other errors
as for the internal logic I guess there are already libraries with good implemnteations that could be refactored and/or integrated with std library..or not if one thinks it's better to start from scratch (from the std library, not really from scratch)

My favourite questions would be:

do we want this traces always enabled?
do we want this traces enabled by an environment variable?
do we want this traces enabled by an in memory variable (like log/slog level)?
- this would be real nice from a devops perspective, for example:
- error rate is low -> no traces -> save some overheaf and save some money with less log streaming
- error rate is high -> traces are automatically enabled without redeployment -> by the time I react to an alert I already can read some traces
- on this note I developed this library just to experiment it's usability from a coder perspective and for helping with these kind of decisions (you can try it as a drop in replacement to the std errors (probably with some bugs)). this is not going to be maintained and it's purpose is not to be integrated in the std library if successful, it is a "what if the std library were to behave like this" kind of project.

S-YOU commented 5 months ago

Adding the environment variable that enable error traces would finally be lead to making as default without environment table.

GODEBUG=errstacktrace=1

errors objects are not free, it cause memory allocations and with tracebacks it is more. Also its depth is depends on usecase, some are ok with 1, or some with 2, I am ok with 3 to 7 in my web applications, some java programmers might want 200. With variable depth it cannot be allocated on stack.

Might be better create another package in standard libary, like stackerror (stackerror.New, stackerror.Wrap, etc...). And current fmt.Errof behaviour should not change, since fmt.Errof("something: %w", stackerror.Wrap(err)) is enough.

mitar commented 5 months ago

I propose serrors, like there is log and slog, we could have errors and serrors. :-)

golang / go

proposal: errors: let GODEBUG=errstacktrace request stack backtraces #63358