Tracing error returns in Go

In https://github.com/sql-machine-learning/sqlflow/issues/2169, @Yancey1989 summarized three categories of error messaging challenges in SQLFlow. This issue focus on the first category.

Currently, we do error messaging in our Go codebase in the following style.

https://play.golang.org/p/7FsNjJBIsUm

It logs/returns an error message as the following

2009/11/10 23:00:00 c called b: b called a: Here happened an error

A slight adjustment looks like the following

https://play.golang.org/p/ZJLq9kGPFaj

The error message consists of multiple lines

2009/11/10 23:00:00 Here happened an error
    /tmp/sandbox203385592/prog.go:18 main.a
    /tmp/sandbox203385592/prog.go:23 main.b
    /tmp/sandbox203385592/prog.go:28 main.c

where the first line is the error itself.

We can log the full text for diagnosis, and we can present the first line to the user.

The proposed improvement in the above comment only records the error but doesn't provide solutions/suggestions. However, what users need are solutions/suggestions.

The invocation of function a is the closest to the happening of the error. Sometimes, the author of function a has the best idea about the suggestions.

The invocation of function c is at the highest abstraction level. Sometimes, the author of function c has the best suggestions.

So, it seems that we need the following kind of message:

2009/11/10 23:00:00 Here happened an error
    /tmp/sandbox203385592/prog.go:18 main.a  You might want to do this.
    /tmp/sandbox203385592/prog.go:23 main.b  You might want to do that.
    /tmp/sandbox203385592/prog.go:28 main.c  You might want to try ...

The proposed improvement in the above comment only records the error but doesn't provide solutions/suggestions. However, what users need are solutions/suggestions.

The invocation of function a is the closest to the happening of the error. Sometimes, the author of function a has the best idea about the suggestions.

The invocation of function c is at the highest abstraction level. Sometimes, the author of function c has the best suggestions.

So, it seems that we need the following kind of message:
2009/11/10 23:00:00 Here happened an error
  /tmp/sandbox203385592/prog.go:18 main.a  You might want to do this.
  /tmp/sandbox203385592/prog.go:23 main.b  You might want to do that.
  /tmp/sandbox203385592/prog.go:28 main.c  You might want to try ...

I think we must find a way to summarize a, b, c to a single diagnostics message. Or it's too distracting to users.

For an example error caused by traning table is empty:

SELECT * FROM iris.train1 TO TRAIN DNNClassifier WITH model.
n_classes = 3 LABEL class INTO sqlflow_models.my_model;

SQLFlow would raise an error:

workflow step failed: runSQLProgram error: fillFieldDesc: empty dataset

The full backtrace of the raised error with would be like:

step.go: workflow step run failed
executor_ir.go: run SQLProgram failed
ir_generator.go:
derivation.go:  infer feature columns failed on `SELECT * FROM iris.train1`
derivation.go: fillFieldDesc: empty dataset

But it's also hardly to understand for users.

Panic-Driven Solution

The above backtrace is confusion for users, talked with @shendiaomo , try to use panic-driven error handling to make the error message clearly.

https://play.golang.com/p/CigZal4Fc7e

func c() error {
    return fmt.Errorf("dataset is empty")
}
func f() error {
    if e := c(); e != nil {
        panic("infer feature columns failed on SELECT ... because dataset is empty, please check the select result")
    }
    return nil
}

panic: infer feature columns failed on SELECT ... because dataset is empty, please check the select result

goroutine 1 [running]:
main.f(...)
    /tmp/sandbox249911169/prog.go:12
main.main()
    /tmp/sandbox249911169/prog.go:17 +0x75

we can pipe the panic message to GUI and logs the panic stack trace in the step logs.

Using ErrType to Help to Diagnose

Let's issue a complex case, g() can raise two error types, how we can do the diagnose:

https://play.golang.com/p/zM6n2gdEwEW

func f() error {
    if e := g(); e != nil {
        switch e.(type) {
        case CError:
            panic("c failed, please try ...")
        case DError:
            panic("d failed, please try...")
        }
    }
    fmt.Println("hello")
    return nil
}

That we can use do the diagnoses according to the different error types.

A more Go style way to bubble up errors and return the stack trace: https://play.golang.com/p/64EaurB2mdb

package main

import (
    "fmt"
    "runtime/debug"
)

type SFError struct {
    stackString string
    msg         string
}

func (e *SFError) Print() {
    fmt.Println(e.stackString)
    fmt.Println("===============")
    fmt.Println(e.msg)
}

func MyError(msg string) *SFError {
    return &SFError{
        stackString: string(debug.Stack()),
        msg:         msg,
    }
}

func level2() *SFError {
    return MyError("error in level2, suggestion: do something else")

}

func level1(test bool) *SFError {
    // do something at level 1 may cause other errors
    if test {
        return MyError("error in level1, suggestion: do something")
    }
    err := level2()
    if err != nil {
        // level1 have better suggestion than level2, but the stack trace is not updated.
        err.msg = err.msg + ", you may also try something else"
        return err

    }
    // or level1 can just return the error
    return err
}

func main() {
    e := level1(false)
    e.Print()

}

In this example, function level1 calls level2, and they both return errors. As mentioned by @wangkuiyi, level1 may know better suggestions than level2, yet there are still cases that level1 can generate a more useful suggestion, so when level1 got the error, it can choose either to return it or add/modify suggestions from level2.

A real case for your information:

https://github.com/sql-machine-learning/sqlflow/blob/3cabfcfdad0b82fa3a3b93aa0552c0c0813280f4/pkg/sql/executor_ir.go#L105-L124

runSQLProgram calls runSingleSQLFlowStatement, then calls generateTrainStmt. generateTrainStmt currently manually parse the attributes and column clause's expressions, it may also generate parsing errors, so, the caller runSingleSQLFlowStatement may need to update the error message generated by generateTrainStmt to some suggestion for the user to fix his statement input.

I believe that we‘ve agreed on the following issues:

We need a layered Error hierarchy.
We need a special Error struct or function that issues a detailed diagnostics message.
We need a more careful checking style than plain if err != nil.

The controversy is about the 2nd issue above:

How to express the Diagnostics semantic.

Option 1: Panic-Driven diagnostics

// IssueDiagnostics will panic and send a diagnostics message to the topmost function on the call stack
func IssueDiagnostics(e error, diagnostics string) {
    panic(Diagnostics{"error: " + e.Error() + ": " + diagnostics})
}

// CheckAndPrintDiagnostics will recover and collect the diagnostics messages
func CheckAndPrintDiagnostics() bool {
    if r := recover(); r != nil {
        switch r.(type) {
        case Diagnostics:
            responseDiagnostics(r)
        default:
            log.Fatal("unexpected panic", r)
        }
    }
}

// MyFunction does something cool or return an error
func MyFunction() error {
    _, e := AnotherFunction()
    switch e.(type) {
    case AnErrorICanDiagnose:
        IssueDiagnostics(e, "please try my solution")
    default:
        return e
    }
}

// runSQLProgram is the topmost caller
func runSQLProgram() error {
    defer CheckAndPrintDiagnostics()
    _, e := MyFunction()
    if e != nil {
        // ...
    }
    _, e := HerFunction()
    if e != nil {
        // ...
    }

}

Option 2: Return the Diagnostics call by call manually

// NewDiagnostics will return a Diagnostics object that hold a diagnostics message
// Actually, there's no explicit mechanism to express `issue a diagnostics message`  without `panic`
func NewDiagnostics(e error, diagnostics string) error {
    return Diagnostics{"error: " + e.Error() + ": " + diagnostics}
}

// MyFunction does something cool or return an error
func MyFunction() error {
    _, e := AnotherFunction()
    switch e.(type) {
    case AnErrorICanDiagnose:
        return NewDiagnostics(e, "please try my solution")
    default:
        return e
    }
}

// runSQLProgram is the topmost caller
func runSQLProgram() {
    _, e := MyFunction()
    if e != nil {
        switch e.(type) {
        case Diagnostics:
            responseDiagnostics(e)
        default:
            // ...
        }
    }
    _, e := HerFunction()
    if e != nil {
        switch e.(type) {
        case Diagnostics:
            responseDiagnostics(e)
        default:
            // ...
        }
    }
}

I prefer the panic-driven solution because it is a more explicit and aloud expression that distinguishes the semantics of "print diagnostics messages and quit asap" from normal error handling semantics of "Fix if possible or else kick it to the superiors".

And see https://golang.org/src/encoding/json/encode.go for a real-world example of panic-driven error handling.

@shendiaomo https://blog.golang.org/defer-panic-and-recover describes how panic works:

Panic is a built-in function that stops the ordinary flow of control and begins panicking. When the function F calls panic, execution of F stops, any deferred functions in F are executed normally, and then F returns to its caller. To the caller, F then behaves like a call to panic. The process continues up the stack until all functions in the current goroutine have returned, at which point the program crashes. Panics can be initiated by invoking panic directly. They can also be caused by runtime errors, such as out-of-bounds array accesses.

If we write some code like:

func train() {
    createTmpTables()
    Run(Codegen(parser.Parse("SELECT ... TO TRAIN ..."))) // may panic here
    dropTmpTables()
    cleanUpModelDirectory()
    ...
}

func main() {
    train()
    // do recover in the server loop
    r := recover()
}

The function call dropTmpTables() and cleanUpModelDirectory() will not be executed as expected.

Yes, we can wrap these calls in one function and use defer then it should be executed by Go, yet we can not assume we always do cleanups or code after panic() are all useless.

And I searched the common use cases of panics and errors like: https://eli.thegreenplace.net/2018/on-the-uses-and-misuses-of-panics-in-go/ and https://stackoverflow.com/questions/44504354/should-i-use-panic-or-return-error, panics normally should be used when the program is indeed not recoverable.

And I searched the common use cases of panics and errors like: https://eli.thegreenplace.net/2018/on-the-uses-and-misuses-of-panics-in-go/ and https://stackoverflow.com/questions/44504354/should-i-use-panic-or-return-error, panics normally should be used when the program is indeed not recoverable.

However, several examples in https://eli.thegreenplace.net/2018/on-the-uses-and-misuses-of-panics-in-go/ are panicking when the program is recoverable.

fmt/scan.go

func (s *ss) scanInt(verb rune, bitSize int) int64 {
  if verb == 'c' {
    return s.scanRune(bitSize)
  }
  s.SkipSpace() // panic
  s.notEOF() // panic
  base, digits := s.getBase(verb) // panic
  // ... other code
}

After demoing this example from the standard library of Go, the blog says: A panic from each of these will be recovered in the public API (like the Token method) and converted to an error.

The usage of panic in json/encode.go and text/template/parse/parser.go is similar to fmt/scan.go. To avoid repetition, I omit the details here.

An example in Effective Go

func server(workChan <-chan *Work) {
    for work := range workChan {
        go safelyDo(work)
    }
}

func safelyDo(work *Work) {
    defer func() {
        if err := recover(); err != nil {
            log.Println("work failed:", err)
        }
    }()
    do(work)
}

It's a server, it panicked, it has recovered.

Diagnosis Error Type

From @wangkuiyi's comment: https://github.com/sql-machine-learning/sqlflow/issues/2171#issuecomment-620999543

The proposed improvement in the above comment only records the error but doesn't provide solutions/suggestions. However, what users need are solutions/suggestions.

The invocation of function a is the closest to the happening of the error. Sometimes, the author of function a has the best idea about the suggestions.

The invocation of function c is at the highest abstraction level. Sometimes, the author of function c has the best suggestions.

So, it seems that we need the following kind of message:
2009/11/10 23:00:00 Here happened an error
  /tmp/sandbox203385592/prog.go:18 main.a  You might want to do this.
  /tmp/sandbox203385592/prog.go:23 main.b  You might want to do that.
  /tmp/sandbox203385592/prog.go:28 main.c  You might want to try ...

I agree with SQLFlow should return more useful messages to users e.g. a suggestion/solution. In my opinion, a meaningful message includes two parts:

for users, we should tell them what happened and how to solve the error.
for the programmer, we should log the stack trace to help to debug.

I prefer users can see one indicated error message with a suggestion:

Here happened an error, you might want to do that.

you might want to do that. can be given by function c/b/c. If the author of function c believes that he can give a better suggestion than function b, than place the diagnostic message from function b.

From https://yourbasic.org/golang/recover-from-panic/#return-a-value

Stack traces can be very useful for debugging:

not only do you see where the error happened,

but also how the program arrived in this place.

We can logs the stack trace on server logs to help developers to debut the error:

/tmp/sandbox203385592/prog.go:18 main.a  
/tmp/sandbox203385592/prog.go:23 main.b 
/tmp/sandbox203385592/prog.go:28 main.c

To summarize, I prefer using Diagnosis struct to record the trace stack and diagnostic message.

type Diagnosis struct {
  message string
  StackStrace string
  err  error
}
func (e *Diagnosis) Error() string {
  return e.message
}
func (e *Diagnosis) Unwrap() error {
  return e.err
}

Diagnosis implements the error interface, so the function can return an error type as usual.

Unwrap is the latest feature in Go 1.13 , Diagnosis can wrap other error type, the author can use Is/As function to help diagnose and give a better suggestion.

func f1() error {
  ...
  return diagnosis.New(ErrEmptyDataset, "dataset empty, please check the dataset result")
}
func f2() error{
  e := f1()
  // Is/As will examine the entire chain in a single call
  if errors.Is(e, ErrEmptyDataset) {
    // func2 can give a better suggession than f1
    return diagnosis.New(ErrEmptyDataset, "fetch an empty dataset from query: %s ")
  } else {
    // no better suggession, pop-up the error
    return e
  }
}

Panic-driven or Not

There are two options to raise the diagnosis message: using panic or not.

I quoted two passages from Dave Cheney's blog about panic

Go solves the exception problem by not having exceptions. Instead, Go allows functions to return an error type in addition to a result via its support for multiple return values

When you panic in Go, you’re freaking out, it’s not someone else problem, it’s game over man.

Go using panic to express the exception, and return an error type to express something went wrong. And Rob Pike coined the Don’t Panic proverb, which is self-explanatory: you should probably avoid it, and return errors instead, ref: https://go-proverbs.github.io .

From https://github.com/golang/go/wiki/PanicAndRecover

By convention, no explicit panic() should be allowed to cross a package boundary. Indicating error conditions to callers should be done by returning an error value

The exposed function of a package should recover and return an error. We can also find some cases, e.g. json/encode.go

To summaries:

We should use panic cautiously, only panic on the exception that causes the program goes to die, for the most case, Diagnosis is not part of it.
The exposed function of a package should use recover to catch the panic and return the Diagnosis error type as much as possible.

Can Do More in The Future

Statistics the diagnosis rate: diagnosis error / all error so that we can continuously replenish diagnostic logic.
Using Unwrap, we can get the whole error chain, and maybe we can summarize some patterns from the error chain to the diagnosis/solution/suggestion.

We only have to check Diagnostics in the top-most function, if we do want to keep it clean without panic, a monadic error type can simplify the error-handling code

sql-machine-learning / sqlflow