runtime: CL 561635 changes relative PCs printed in tracebacks

aclements commented 8 months ago

The intent of CL 561635 was to add absolute pc=0x%xs to inlined frames in tracebacks when GOTRACEBACK is set to system or higher. The ultimate goal of this was to be able to compute function-relative line numbers for telemetry reports.

Unfortunately, it had the side-effect of changing the relative PCs printed in all traceback modes. Specifically, if f calls g calls h and g and h are inlined into f, we used to print the "innermost" relative PC on the f frame, even though that technically points into h's body, and not print relative PCs on g or h, like this:

pkg/h(...)
    pkg/x.go:266
pkg/g(...)
    pkg/x.go:267
pkg/f()
    pkg/x.go:268 +0x26

where 0x26 points to an instruction in the inlined copy of h.

With @adonovan 's change, we now print the "outermost" relative PC on the f frame, which is a NOP marker for where g was inlined. This is, in some sense, more consistent with the absolute PCs that can be printed, but in the common case it loses information.

There are a few hard constraints here:

Printing sym+offset only makes sense when sym is a symbol with an entry point that offset can be relative to. For an inlined stack nest, that means we should only print this on the outermost frame.
If we print both an offset and an absolute PC, they should agree.

And several nice-to-haves:

Panics must include relative PCs. I don't think most users care about this at all, but people often paste tracebacks in Go issues and these relative PCs can be critical for us. It's just enough information that we rarely need to ask for a GOTRACEBACK=system traceback.
runtime.Stack must agree with printed tracebacks at the default GOTRACEBACK level.
Include PCs for every frame if GOTRACEBACK=system.
Don't lose information about the PC. That means, if we only print one PC (whether relative or absolute), it should point at the "innermost" inlined frame because you can derive the other PCs from that. This is especially true if we wind up moving the NOPs to the end of the function, in which case pointing at one of those will be next to useless.
Attach PCs to the frames they logically represent.

I think @adonovan 's changes to traceback do satisfy 1–5 and 7, but violate 6.

Here are the options that come to my mind:

We could revert the change. Then we satisfy 1–4 and 6, but violate 5 and 7. Then we're back at the original problem that telemetry had. We have a few alternatives for telemetry to get relative line numbers:

a. We just start printing relative line numbers in the traceback.

b. 1a but gated by a GOTRACEBACK setting (orthogonal to its current settings)

c. An option for structured tracebacks.
We include meaningful relative PCs for every frame, including inlined frames. Something like

pkg/h(...)
    pkg/x.go:266 f+0x26
pkg/g(...)
    pkg/x.go:267 f+0x16
pkg/f()
    pkg/x.go:268 +0x13

To reduce line noise, I've stripped the package path from the outer symbol printed for the inlined frames. We could maybe reduce it even further by just saying ...+0x26 or parent+0x26 or something. I think this satisfies all of the constraints and nice-to-haves.

Options 1a and 1b are pretty straightforward in the runtime.

Option 1c is probably a good idea, but is also a whole endeavor.

Option 2 is a heavier lift but arguably more principled.

adonovan commented 8 months ago

Thanks for the throrough analysis.

Yes, I (actually x/telemetry) need PCs only to compute func-relative line numbers, so option 2 would work. But I expect other telemetry systems will surely want as much information as they can get, especially if it relieves them of the need to (a) fork a child process with the same pclntab to decode the message printed via SetCrashOutput and (b) subtract the relative text segment offset from all addresses. A structured complete dump would address that. One concern: GOTRACEBACK=foo can increase the level but not decrease it, so a crash monitor would want "structured" to be the maximum value, perpetually, in case a user inadvertently lowers the value, causing the crash reports to be useless to the monitor. Or perhaps we should change SetCrashOutput to always emit structured output before the API freezes. I think this is my favorite option, but it's definitely more work on both sides. I'd be willing to implement it. I don't think it's that hard if we can agree on the schema.

aclements commented 8 months ago

@adonovan , I think I'm not clear on how it's using even the absolute PCs to get relative line numbers. Doesn't it still have to parse the functab to get function start lines? Or is it assuming the symbolic information for the entry PC of a function represents its first line?

What do you think of options 1a or 1b? If we're considering option 2, then we're already open to tweaking the traceback text format, and 1a/1b avoid a lot of gross work on the telemetry side.

Structured tracebacks will be a fair amount of work. We may want to make sure the SetCrashOutput API is able to support that in the future, but I'm not sure we want to otherwise block this on designing that.

adonovan commented 8 months ago

@adonovan , I think I'm not clear on how it's using even the absolute PCs to get relative line numbers. Doesn't it still have to parse the functab to get function start lines? Or is it assuming the symbolic information for the entry PC of a function represents its first line?

It parses the absolute PCs out of the SetCrashOutput text, subtracts the text segment offset (difference between parent and child process, same executable), then formats the stack using the same runtime.CallersFrames-based function that is used in the ordinary "grab current stack and increment counter" workflow. The function-relative line number is computed usingfr.Line - fr.Func.FileLine(fr.Entry) on the runtime.Frame.

What do you think of options 1a or 1b? If we're considering option 2, then we're already open to tweaking the traceback text format, and 1a/1b avoid a lot of gross work on the telemetry side.

I suspect most users won't want to see relative line numbers in the traceback, and will want to see absolute ones, so there's some redundancy there. (Relative numbers can easily be confused for absolute ones, though I recently changed x/telemetry to include the explicit sign "foo:+3" to ameliorate that.)

But I think options 1a and 1b may be overfitting to the current rather ad hoc needs of x/telemetry, and that other telemetry systems (including future variants of ours) may want more information, including (de-ASLR'd) program counters plus whatever else can be scraped out of CallersFrames--in essence, the full structured dump.

aclements commented 7 months ago

Adding release-blocker because the change as it stands has lost information about the PC.

adonovan commented 7 months ago

Here's a starting point for a schema for a structured dump. JSON seems like the obvious syntax; I don't think it should be that hard to encode in a single pass to a file descriptor.

I chose the schema below to avoid a commitment that every non-leaf frame has a PC, since (as you've mentioned), it's a fiction, and the non-leaf PCs are not all CALLs. The non-leaf PCs are in most cases just a means to obtain File/Line/etc information, but since that's provided in the dump, the issue is moot.

[Edit: the diff shows the evolution of the first draft during subsequent discssion.]

// A Crash describes a runtime crash.
type Crash struct {
    Message    string       // fatal message or panic value
    PanicType  string       // operand type of panic(x) call; "" => not a panic
-   TextOffset uintptr      // ASLR offset of text segment
+       RuntimeText uintptr     // address of runtime.text symbol (for un-applying ASLR)
    Goroutines []Goroutines // set of all goroutines
}

// A Goroutine describes a goroutine that was live at the moment of the crash.
type Goroutine struct {
    ID    int     // unique identifier
    State string  // running, runnable, wait, etc.
    Stack []Frame // stack of active Go function calls, outermost first
}

// A Frame is a logical stack frame entry.
// The last frame in the Stack is the "leaf".
type Frame struct {
    // information about the logical function
    Function  string // canonical name of function (e.g "pkg.T.f"); "" if unknown
    Filename  string // name of file declaring function; "" if unknown
    EntryPC   uintptr // text segment address of function entrypoint  (is this meaningful if Inlined??)
    StartLine int  // line number of declaration's 'func' token; 0 if unknown

    // information about the goroutine's position within this function
    PC        uintptr // text segment address of current PC (if leaf), active call (if callee is not inlined), or 0 if callee inlined
    Line      int     // line number of PC (if leaf) or active call (non-leaf)
    Inlined   bool    // this function was inlined (=> parent Frame has no PC)
}

We needn't gate the above issue on implementing a full structured dump, but I'd be willing to invest a day to get a prototype together to see if this looks like the right approach. Let me know what you think.

cherrymui commented 7 months ago

TextOffset uintptr // ASLR offset of text segment

The address of text segment is pretty hard to get in a portable way. Maybe we could use the address of the runtime.text symbol, which is "the start address of Go text" (which is essentially the text segment address, but in cgo programs there could be C functions before Go text and it is pretty hard to know the segment start)?

EntryPC uintptr // text segment address of function entrypoint (is this meaningful if Inlined??)

For inlined frames, the entry PC is probably not meaningful. As the PC would be 0, EntryPC probably should be 0 as well.

On the other hand, how would one compute relative PC (assuming it is useful)? Say, G is inlined to F, G crashes, we'd have F:

EntryPC: F's start address
PC:      0
Inlined: false

G

EntryPC: 0??
PC:      faulting address
Inlined: true

So we'd need FrameG.PC - FrameF.EntryPC? Which seems a bit awkward... Using F's start address as G's EntryPC is also somewhat weird, though...

adonovan commented 7 months ago

Maybe we could use the address of the runtime.text symbol, which is "the start address of Go text".

Sounds good.

RuntimeText uintptr // address of runtime.text symbol (for un-applying ASLR)

For inlined frames, the entry PC is probably not meaningful. As the PC would be 0, EntryPC probably should be 0 as well.

On the other hand, how would one compute relative PC (assuming it is useful)?

It's not obvious to me that it is useful. The virtual call (NOP) PCs were a means to an end (file/line/etc) that would now be served directly.

adonovan commented 7 months ago

This comment in traceback makes me wonder whether JSON is the right encoding, since a truncated JSON document is not a JSON document:

    // - We'd prefer to walk the stack once because in really bad situations
    //   traceback may crash (and we want as much output as possible) or the stack
    //   may be changing.

Is this rare enough in practice not to matter?

cherrymui commented 7 months ago

Traceback crashing is usually a runtime bug or memory corruption e.g. due to misuse of unsafe or cgo. If we say that this feature is mostly for debugging user code, not runtime bugs, then maybe we could say it is rare.

gopherbot commented 7 months ago

Change https://go.dev/cl/571676 mentions this issue: Revert "runtime: traceback: include pc=0x%x for inline frames"

gopherbot commented 7 months ago

Change https://go.dev/cl/571797 mentions this issue: runtime: revert "traceback: include pc=0x%x for inline frames"

rsc commented 7 months ago

The JSON should use the same field names as exported runtime data structures and methods whenever possible. This means s/Filename/File/ and s/EntryPC/Entry/.

aclements commented 7 months ago

I like the general shape of this, but have two big comments, both of which take inspiration from go test -json:

Rather than producing one long JSON object, which as you noted can be a problem if it gets truncated, I'd much rather we produce a stream of newline-separated JSON objects. That's still easy to parse (especially if we can really guarantee it's all JSON, unlike go test -json), but much more robust. It's also generally easier to process in a streaming manner if that matters, because you know you'll get high-level information before lower-level information. We could have a few different types of records for "starting a panic trace", "starting a goroutine", and "frame".
Consider including everything that would appear on stderr in these records. E.g., in the go test -json format, you can just concatenate the Output fields of all the records to reconstitute the original text format. That leads to some redundancy, but can be incredibly useful.

adonovan commented 7 months ago

I'd much rather we produce a stream of newline-separated JSON objects.

@prattmic and I were just discussing this; like @cherrymui he thought truncation would arise only very rarely due to obscure runtime bugs (e.g. signal race with vdso), and that losing those records wouldn't stop application developers from getting good coverage of their bugs from field reports. (Of course, the runtime team might have a different valuation of exactly those bug reports.)

Consider including everything that would appear on stderr in these records.

By "everything", do you mean all the strings printed during the traceback operation---the unstructured crash report? That would be redundant with the structured report, though perhaps slightly easier to read. (I can't imagine you are proposing that the entire process's stderr, even before the crash, would be JSON-framed and emitted to the crash fd.)

adonovan commented 7 months ago

Consider including everything that would appear on stderr in these records.

Also, that would require buffering, and we can't be allocating at this point.

aclements commented 7 months ago

By "everything", do you mean all the strings printed during the traceback operation

Sorry, yes, I mean just the traceback. I'm thinking, if we tie this to SetCrashOutput, I strongly suspect there are going to be tools that want to reconstitute the stderr text report.

Also, that would require buffering, and we can't be allocating at this point.

It's not straightforward, but it is doable. One way I can think of (there may be simpler ways) is that traceback would construct up a packet of information representing the frame to be printed and that can be printed as many times and different ways as necessary. There's a strictly bounded amount of information at that level. It only becomes unbounded and difficult to work with once you actually start emitting the strings.

adonovan commented 7 months ago

if we tie this to SetCrashOutput, I strongly suspect there are going to be tools that want to reconstitute the stderr text report.

It would be much simpler to define a String method on Crash that formats the (decoded) crash report in a similar (but not identical) form. Would that suffice?

traceback would construct up a packet of information representing the frame to be printed and that can be printed as many times and different ways as necessary.

That's essentially the approach I was thinking of in any case, to avoid the need to duplicate the sprinkling of two kinds of print statements throughout the traceback algorithm: each packet would be (a) print()ed to stderr, plus optionally (b) JSON-encoded to crashfd. I suppose it would be possible to repeat the formatting done by (a), but outputting into a JSON string, without buffering or code duplication, by replacing all the print statements by an abstract print function (with string/int variants).

gopherbot commented 7 months ago

Change https://go.dev/cl/571656 mentions this issue: internal/crashmonitor: update assertions after CL 561635 revert

aclements commented 7 months ago

It would be much simpler to define a String method on Crash that formats the (decoded) crash report in a similar (but not identical) form. Would that suffice?

I'm not sure I understand what this would mean. What's Crash? Are you proposing that we define an exported type somewhere with the JSON definition (which would then have a String method)? We could do that, though one of the nice things about JSON is how easily it can evolve, especially if we say it's not part of the compatibility guarantee just like the text format isn't. With a stderr field, tools can continue to reconstitute the text even if the details change or if they're built against a different version of Go than the one producing the crash output.

adonovan commented 7 months ago

What's Crash? Are you proposing that we define an exported type somewhere with the JSON definition (which would then have a String method)?

Crash as defined here. Yes, that's what I had in mind.

We could do that, though one of the nice things about JSON is how easily it can evolve, especially if we say it's not part of the compatibility guarantee just like the text format isn't. With a stderr field, tools can continue to reconstitute the text even if the details change or if they're built against a different version of Go than the one producing the crash output.

Isn't the purpose of a structured dump to make well-documented promises about the schema of what the reader can expect, just as we do for (say) go list -json? I don't understand why we would exempt the schema from the compatibility promise, especially because, as you say, JSON is so evolvable. (As with go list, it's not necessary to actually export JSON declarations, but we would need to commit to them.)

cherrymui commented 7 months ago

With Russ's CLs the original breaking change is resolved, so I think this doesn't need to be a release blocker now.

aclements commented 5 months ago

The issue with relative PCs has long since been resolved by backing out CL 561635 and tweaking runtime.CallersFrames to work with the PCs that are printed in the traceback. #67182 is going to add an options struct to debug.SetCrashOutput that we can use in the future to add structured crash output. Closing this as completed. We'll have a follow-on issue to propose structured crash output.

golang / go

runtime: CL 561635 changes relative PCs printed in tracebacks #65761