Closed aclements closed 5 months ago
Thanks for the throrough analysis.
Yes, I (actually x/telemetry) need PCs only to compute func-relative line numbers, so option 2 would work. But I expect other telemetry systems will surely want as much information as they can get, especially if it relieves them of the need to (a) fork a child process with the same pclntab to decode the message printed via SetCrashOutput and (b) subtract the relative text segment offset from all addresses. A structured complete dump would address that. One concern: GOTRACEBACK=foo can increase the level but not decrease it, so a crash monitor would want "structured" to be the maximum value, perpetually, in case a user inadvertently lowers the value, causing the crash reports to be useless to the monitor. Or perhaps we should change SetCrashOutput to always emit structured output before the API freezes. I think this is my favorite option, but it's definitely more work on both sides. I'd be willing to implement it. I don't think it's that hard if we can agree on the schema.
@adonovan , I think I'm not clear on how it's using even the absolute PCs to get relative line numbers. Doesn't it still have to parse the functab to get function start lines? Or is it assuming the symbolic information for the entry PC of a function represents its first line?
What do you think of options 1a or 1b? If we're considering option 2, then we're already open to tweaking the traceback text format, and 1a/1b avoid a lot of gross work on the telemetry side.
Structured tracebacks will be a fair amount of work. We may want to make sure the SetCrashOutput
API is able to support that in the future, but I'm not sure we want to otherwise block this on designing that.
@adonovan , I think I'm not clear on how it's using even the absolute PCs to get relative line numbers. Doesn't it still have to parse the functab to get function start lines? Or is it assuming the symbolic information for the entry PC of a function represents its first line?
It parses the absolute PCs out of the SetCrashOutput text, subtracts the text segment offset (difference between parent and child process, same executable), then formats the stack using the same runtime.CallersFrames-based function that is used in the ordinary "grab current stack and increment counter" workflow. The function-relative line number is computed usingfr.Line - fr.Func.FileLine(fr.Entry)
on the runtime.Frame.
What do you think of options 1a or 1b? If we're considering option 2, then we're already open to tweaking the traceback text format, and 1a/1b avoid a lot of gross work on the telemetry side.
I suspect most users won't want to see relative line numbers in the traceback, and will want to see absolute ones, so there's some redundancy there. (Relative numbers can easily be confused for absolute ones, though I recently changed x/telemetry to include the explicit sign "foo:+3"
to ameliorate that.)
But I think options 1a and 1b may be overfitting to the current rather ad hoc needs of x/telemetry, and that other telemetry systems (including future variants of ours) may want more information, including (de-ASLR'd) program counters plus whatever else can be scraped out of CallersFrames--in essence, the full structured dump.
Adding release-blocker because the change as it stands has lost information about the PC.
Here's a starting point for a schema for a structured dump. JSON seems like the obvious syntax; I don't think it should be that hard to encode in a single pass to a file descriptor.
I chose the schema below to avoid a commitment that every non-leaf frame has a PC, since (as you've mentioned), it's a fiction, and the non-leaf PCs are not all CALLs. The non-leaf PCs are in most cases just a means to obtain File/Line/etc information, but since that's provided in the dump, the issue is moot.
[Edit: the diff shows the evolution of the first draft during subsequent discssion.]
// A Crash describes a runtime crash.
type Crash struct {
Message string // fatal message or panic value
PanicType string // operand type of panic(x) call; "" => not a panic
- TextOffset uintptr // ASLR offset of text segment
+ RuntimeText uintptr // address of runtime.text symbol (for un-applying ASLR)
Goroutines []Goroutines // set of all goroutines
}
// A Goroutine describes a goroutine that was live at the moment of the crash.
type Goroutine struct {
ID int // unique identifier
State string // running, runnable, wait, etc.
Stack []Frame // stack of active Go function calls, outermost first
}
// A Frame is a logical stack frame entry.
// The last frame in the Stack is the "leaf".
type Frame struct {
// information about the logical function
Function string // canonical name of function (e.g "pkg.T.f"); "" if unknown
Filename string // name of file declaring function; "" if unknown
EntryPC uintptr // text segment address of function entrypoint (is this meaningful if Inlined??)
StartLine int // line number of declaration's 'func' token; 0 if unknown
// information about the goroutine's position within this function
PC uintptr // text segment address of current PC (if leaf), active call (if callee is not inlined), or 0 if callee inlined
Line int // line number of PC (if leaf) or active call (non-leaf)
Inlined bool // this function was inlined (=> parent Frame has no PC)
}
We needn't gate the above issue on implementing a full structured dump, but I'd be willing to invest a day to get a prototype together to see if this looks like the right approach. Let me know what you think.
TextOffset uintptr // ASLR offset of text segment
The address of text segment is pretty hard to get in a portable way. Maybe we could use the address of the runtime.text
symbol, which is "the start address of Go text" (which is essentially the text segment address, but in cgo programs there could be C functions before Go text and it is pretty hard to know the segment start)?
EntryPC uintptr // text segment address of function entrypoint (is this meaningful if Inlined??)
For inlined frames, the entry PC is probably not meaningful. As the PC
would be 0, EntryPC
probably should be 0 as well.
On the other hand, how would one compute relative PC (assuming it is useful)? Say, G is inlined to F, G crashes, we'd have F:
EntryPC: F's start address
PC: 0
Inlined: false
G
EntryPC: 0??
PC: faulting address
Inlined: true
So we'd need FrameG.PC - FrameF.EntryPC
? Which seems a bit awkward... Using F's start address as G's EntryPC
is also somewhat weird, though...
Maybe we could use the address of the
runtime.text
symbol, which is "the start address of Go text".
Sounds good.
RuntimeText uintptr // address of runtime.text symbol (for un-applying ASLR)
For inlined frames, the entry PC is probably not meaningful. As the
PC
would be 0,EntryPC
probably should be 0 as well.On the other hand, how would one compute relative PC (assuming it is useful)?
It's not obvious to me that it is useful. The virtual call (NOP) PCs were a means to an end (file/line/etc) that would now be served directly.
This comment in traceback makes me wonder whether JSON is the right encoding, since a truncated JSON document is not a JSON document:
// - We'd prefer to walk the stack once because in really bad situations
// traceback may crash (and we want as much output as possible) or the stack
// may be changing.
Is this rare enough in practice not to matter?
Traceback crashing is usually a runtime bug or memory corruption e.g. due to misuse of unsafe or cgo. If we say that this feature is mostly for debugging user code, not runtime bugs, then maybe we could say it is rare.
Change https://go.dev/cl/571676 mentions this issue: Revert "runtime: traceback: include pc=0x%x for inline frames"
Change https://go.dev/cl/571797 mentions this issue: runtime: revert "traceback: include pc=0x%x for inline frames"
The JSON should use the same field names as exported runtime data structures and methods whenever possible. This means s/Filename/File/ and s/EntryPC/Entry/.
I like the general shape of this, but have two big comments, both of which take inspiration from go test -json
:
go test -json
), but much more robust. It's also generally easier to process in a streaming manner if that matters, because you know you'll get high-level information before lower-level information. We could have a few different types of records for "starting a panic trace", "starting a goroutine", and "frame".go test -json
format, you can just concatenate the Output
fields of all the records to reconstitute the original text format. That leads to some redundancy, but can be incredibly useful.I'd much rather we produce a stream of newline-separated JSON objects.
@prattmic and I were just discussing this; like @cherrymui he thought truncation would arise only very rarely due to obscure runtime bugs (e.g. signal race with vdso), and that losing those records wouldn't stop application developers from getting good coverage of their bugs from field reports. (Of course, the runtime team might have a different valuation of exactly those bug reports.)
Consider including everything that would appear on stderr in these records.
By "everything", do you mean all the strings printed during the traceback operation---the unstructured crash report? That would be redundant with the structured report, though perhaps slightly easier to read. (I can't imagine you are proposing that the entire process's stderr, even before the crash, would be JSON-framed and emitted to the crash fd.)
Consider including everything that would appear on stderr in these records.
Also, that would require buffering, and we can't be allocating at this point.
By "everything", do you mean all the strings printed during the traceback operation
Sorry, yes, I mean just the traceback. I'm thinking, if we tie this to SetCrashOutput
, I strongly suspect there are going to be tools that want to reconstitute the stderr text report.
Also, that would require buffering, and we can't be allocating at this point.
It's not straightforward, but it is doable. One way I can think of (there may be simpler ways) is that traceback would construct up a packet of information representing the frame to be printed and that can be printed as many times and different ways as necessary. There's a strictly bounded amount of information at that level. It only becomes unbounded and difficult to work with once you actually start emitting the strings.
if we tie this to SetCrashOutput, I strongly suspect there are going to be tools that want to reconstitute the stderr text report.
It would be much simpler to define a String method on Crash that formats the (decoded) crash report in a similar (but not identical) form. Would that suffice?
traceback would construct up a packet of information representing the frame to be printed and that can be printed as many times and different ways as necessary.
That's essentially the approach I was thinking of in any case, to avoid the need to duplicate the sprinkling of two kinds of print statements throughout the traceback algorithm: each packet would be (a) print()
ed to stderr, plus optionally (b) JSON-encoded to crashfd. I suppose it would be possible to repeat the formatting done by (a), but outputting into a JSON string, without buffering or code duplication, by replacing all the print
statements by an abstract print function (with string/int variants).
Change https://go.dev/cl/571656 mentions this issue: internal/crashmonitor: update assertions after CL 561635 revert
It would be much simpler to define a String method on Crash that formats the (decoded) crash report in a similar (but not identical) form. Would that suffice?
I'm not sure I understand what this would mean. What's Crash
? Are you proposing that we define an exported type somewhere with the JSON definition (which would then have a String method)? We could do that, though one of the nice things about JSON is how easily it can evolve, especially if we say it's not part of the compatibility guarantee just like the text format isn't. With a stderr field, tools can continue to reconstitute the text even if the details change or if they're built against a different version of Go than the one producing the crash output.
What's Crash? Are you proposing that we define an exported type somewhere with the JSON definition (which would then have a String method)?
Crash as defined here. Yes, that's what I had in mind.
We could do that, though one of the nice things about JSON is how easily it can evolve, especially if we say it's not part of the compatibility guarantee just like the text format isn't. With a stderr field, tools can continue to reconstitute the text even if the details change or if they're built against a different version of Go than the one producing the crash output.
Isn't the purpose of a structured dump to make well-documented promises about the schema of what the reader can expect, just as we do for (say) go list -json
? I don't understand why we would exempt the schema from the compatibility promise, especially because, as you say, JSON is so evolvable. (As with go list, it's not necessary to actually export JSON declarations, but we would need to commit to them.)
With Russ's CLs the original breaking change is resolved, so I think this doesn't need to be a release blocker now.
The issue with relative PCs has long since been resolved by backing out CL 561635 and tweaking runtime.CallersFrames to work with the PCs that are printed in the traceback. #67182 is going to add an options struct to debug.SetCrashOutput that we can use in the future to add structured crash output. Closing this as completed. We'll have a follow-on issue to propose structured crash output.
The intent of CL 561635 was to add absolute
pc=0x%x
s to inlined frames in tracebacks when GOTRACEBACK is set tosystem
or higher. The ultimate goal of this was to be able to compute function-relative line numbers for telemetry reports.Unfortunately, it had the side-effect of changing the relative PCs printed in all traceback modes. Specifically, if f calls g calls h and g and h are inlined into f, we used to print the "innermost" relative PC on the f frame, even though that technically points into h's body, and not print relative PCs on g or h, like this:
where 0x26 points to an instruction in the inlined copy of h.
With @adonovan 's change, we now print the "outermost" relative PC on the f frame, which is a NOP marker for where g was inlined. This is, in some sense, more consistent with the absolute PCs that can be printed, but in the common case it loses information.
There are a few hard constraints here:
And several nice-to-haves:
GOTRACEBACK=system
traceback.GOTRACEBACK=system
.I think @adonovan 's changes to traceback do satisfy 1–5 and 7, but violate 6.
Here are the options that come to my mind:
We could revert the change. Then we satisfy 1–4 and 6, but violate 5 and 7. Then we're back at the original problem that telemetry had. We have a few alternatives for telemetry to get relative line numbers:
a. We just start printing relative line numbers in the traceback.
b. 1a but gated by a GOTRACEBACK setting (orthogonal to its current settings)
c. An option for structured tracebacks.
We include meaningful relative PCs for every frame, including inlined frames. Something like
To reduce line noise, I've stripped the package path from the outer symbol printed for the inlined frames. We could maybe reduce it even further by just saying
...+0x26
orparent+0x26
or something. I think this satisfies all of the constraints and nice-to-haves.Options 1a and 1b are pretty straightforward in the runtime.
Option 1c is probably a good idea, but is also a whole endeavor.
Option 2 is a heavier lift but arguably more principled.