Open hyangah opened 1 year ago
We will see if the increased max len and compression mechanism to be added for #61181 alleviate the issue.
Change https://go.dev/cl/509435 mentions this issue: internal/counter: compress stack counter names
With a longer limit on stack counter names, this is probably not urgent, and may not even be worth the complexity.
For example, there are many redundant frames for bug.Errorf here, but on the other hand it doesn't affect the viability of the data, and is somewhat useful to know that these are all bug reports.
Per discussion today, if we want to have any automation (such as @adonovan's new x/tools/gopls/internal/telemetry/cmd/stacks) that detects e.g. the most significant frame in the call stack, it could be useful to make that frame the top of the call stack.
That's a newly compelling reason to do this.
The new compression of stack counter names replaces the "path" part of a symbol with a ditto mark ("
) if it is the same as the previous line. I dislike it for several reasons.
I propose we revert it.
Add it to the agenda.
On Mon, Jan 29, 2024 at 5:19 PM Alan Donovan @.***> wrote:
The new compression of stack counter names replaces the "path" part of a symbol with a ditto mark (") if it is the same as the previous line. I dislike it for several reasons.
- The ditto mark (an unbalanced quote) is surprising and confusing. While it's common in penmanship, it is not a software convention. The first time I saw it I started debugging my program to find out why it was emitting garbage.
- The compression makes the stack harder to use. Grep doesn't work. My tests in CL 558256 don't work. It requires a stateful decoder to read it.
- The split function is incorrect for generic symbols, which may contain periods, making the result even more confusing.
I propose we revert it.
— Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/61234#issuecomment-1915675228, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJIAI3I5PKILX2XVA42L23YRAN63AVCNFSM6AAAAAA2CL7UGWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJVGY3TKMRSHA . You are receiving this because you are on a team that was mentioned.Message ID: @.***>
@adonovan it looks like that compression was added for a reason. We shouldn't revert without addressing that reason in a different way. Increasing the max name length seems reasonable, as does @hyangah's proposal in this issue to allow skips (which we want to do anyway: https://github.com/golang/go/issues/61234#issuecomment-1821845528).
More generally, it feels like the real problem here is the loose coupling between x/telemetry and the new functionality added in https://go.dev/cl/558256. The counter file format is not intended as an externally facing API, and so it is not surprising that it doesn't make for a good API. Since we're going eventually move the watchdog functionality into x/telemetry, that concern is only temporary. So I don't find (1) or (2) particularly compelling reasons to revert.
But the bugginess of (3) is a real problem. Since fixing that feels like it will involve significant complexity, reverting and instead increasing the max name length sounds like a reasonable path forward. Keep it simple.
Why is (3) a problem if the raw counter name is not meant to be interpreted directly and running decoder can recover it to the original stack trace?
We shouldn't revert without addressing that reason in a different way.
Of course; increasing the limit seems like a good solution. I can't imagine the extra size of these messages has a significant dollar cost compared the their immense value for debugging.
More generally, it feels like the real problem here is the loose coupling between x/telemetry and the new functionality added in https://go.dev/cl/558256. The counter file format is not intended as an externally facing API, and so it is not surprising that it doesn't make for a good API. Since we're going eventually move the watchdog functionality into x/telemetry, that concern is only temporary. So I don't find (1) or (2) particularly compelling reasons to revert.
I agree that we should move watchdog into x/telemetry at some point, and that would justify writing a more elaborate decoder for the watchdog test, but I don't think that changes the thrust of my argument.
The counter names are not an API, but they are an interface, and they are exposed via json files such as https://storage.googleapis.com/prod-telemetry-merged/2023-12-05.json to tools that operate on them such as stacks. The stacks command parses the stacks to extract the likely culprit symbol, and (as you know) I would like it to additionally mark up each frame to the correct file@version:line in CodeSearch.
Perhaps you're saying the intent was that counter names are merely unstructured strings, but that seems like a lost opportunity to give stacks the special treatment that they--the primary data structure for post-mortem analysis--surely deserve. We should either specify the grammar of these strings, or provide a package that parses them. I think the existing uncompressed format is already pretty good.
@adonovan sorry, I thought the uploads already parsed the counter file, and therefore included the expanded name. Is that not the case?
@adonovan sorry, I thought the uploads already parsed the counter file, and therefore included the expanded name. Is that not the case?
Ah, I didn't realize that was the case (if it is indeed the case). So it sounds like the compressed form is supposed to be an internal (but stable and precisely specified) detail and you already have routines for compressing and decompressing it that my test could take advantage of, especially when it is moved into x/telemetry.
@hyangah
Why is (3) a problem if the raw counter name is not meant to be interpreted directly and running decoder can recover it to the original stack trace?
You're right, it's not a big problem, since the compression/expansion is deterministic. Just a bit sloppy.
@adonovan yes, that is the point I was trying to make: when in x/telemetry you wouldn't only interact with parsed counter files. I confirmed that this is the case, and the uploaded payloads are expanded.
Currently
StackCounter.Inc
captures theInc
caller's stack usingruntime.Callers(2, ...)
. One use case we want to explore is to have a StackCounter to track calls togopls/internal/bug.report
.If we call
StackCounter.Inc
from thisreport
function (wrapper ofStackCounter.Inc
), the first couple of entries are always the functions in thisgopls/internal/bug
package, which isn't very interesting while taking up a good fraction of the counter name which already has pretty small max limit (https://github.com/golang/go/issues/61181).Can we allow the api to configure custom skip depth like
runtime.Callers
orlog.Logger.Output
?Or alternatively, can we have
StackCounter.Inc
takepcs
?cc @golang/tools-team @rsc