Closed kolesnikovae closed 11 months ago
We actually don't need SDK to do this.
@aleks-p Recently added this into our usagestats
, the same discovery process could be used at ingest time.
The big question though is how store and introduce it in the API ?
We could store it in tsdb (using a hidden labels) or in parquet.
For pprof API we can use sample labels, but not sure for the SelectMergeStacktraces
API
We actually don't need SDK to do this.
Knowledge of the exact client SDK version might be very valuable when it comes to parsing of the payload (example). Over time we will inevitably want to deprecate some old versions, which is way easier when we do know the version and not guess it.
Moreover, I believe we should be able to identify the profile origin, not just SDK. For example, we could introduce __origin__
label that follows well known User Agent semantics, or yet better use multiple labels, to make it more structured.
I'd like to hear your opinion @korniltsev
@aleks-p Recently added this into our usagestats, the same discovery process could be used at ingest time.
The issue is about optimisation of this process: it consumes up to 8-12% of CPU time in distributors (6 CPUs solely in our biggest cluster). Not sure if I get your idea of using this at ingest time – if you mean that we should propagate the value (e.g to spy_name
label which we already have for this purpose), then I'm not sure how we're going to use it later. Although, I think it makes sense to perform the detection earlier, at parsing time, because in some cases we rely on spyName
API parameter but it may be missing.
The big question though is how store and introduce it in the API ? We could store it in tsdb (using a hidden labels) or in parquet. For pprof API we can use sample labels, but not sure for the SelectMergeStacktraces API
For me the biggest question is how this information is supposed to be used. The only use case I'm aware of is flamegraph colouring and language-specific decoration of function names. But for that we need to know the language of each function (call-site) because each stack trace may include frames from different runtimes (userspace/kernel, FFI, etc). I'd say that "language" is an attribute of the mapping (in pprof terms; typically, the binary object/image the code is loaded from). I'd even say that the very idea of "profile-wise language" is somewhat limited; it works well for statistical purposes but not beyond.
Although, we could potentially inform frontend about the "major" language, I'm not sure if it's worth it: frontend can do this on its own easily. Randomly sample some functions from the profile, check regexes, and voilà – you know the "major" language of the profile.
I think that this topic intersects with our another discussion about displaying versions of the application (function breakdown by version) in the flamegraph.
@cyriltovena perhaps you misunderstood, this issue is primarily about optimizing the work done for the usage stats, as it looks like it is using quite a bit of cpu right now. Sampling, substring matching and skipping altogether if we can are all good ways to do this.
perhaps you misunderstood,
Sorry about that. Yes that makes total sense to try to do this early, then sampling.
Regexp:
go.cpu.labels.pprof-10 330040 10853 ns/op
heap-10 195819 18407 ns/op
dotnet.labels.pprof-10 143966 24949 ns/op
profile_java-10 61840 58016 ns/op
profile_nodejs-10 109437 32871 ns/op
profile_python-10 391418 9156 ns/op
profile_ruby-10 307896 11783 ns/op
profile_rust-10 636633 5668 ns/op
String matching (prefix or suffix):
go.cpu.labels.pprof-10 3418387 1060 ns/op
heap-10 2446440 1483 ns/op
dotnet.labels.pprof-10 1751536 2057 ns/op
profile_java-10 2333244 1541 ns/op
profile_nodejs-10 1722144 2083 ns/op
profile_python-10 4395427 821 ns/op
profile_ruby-10 4546932 793 ns/op
profile_rust-10 2960469 1221 ns/op
String matching + sampling:
go.cpu.labels.pprof-10 2535682 1415 ns/op
heap-10 7298155 493 ns/op
dotnet.labels.pprof-10 1748709 2064 ns/op
profile_java-10 3828352 940 ns/op
profile_nodejs-10 1721338 2087 ns/op
profile_python-10 7062268 513 ns/op
profile_ruby-10 9947100 364 ns/op
profile_rust-10 2941592 1225 ns/op
I'll go with option 2 (no sampling) for now. Sampling makes things better at times but it increases the likelihood of failing to detect a language.
Looks good!
Sampling benchmarking vastly depends on the input data. In the worst case, when none of the patterns matches, the benefit will be way more pronounced.
go.cpu.labels.pprof-10 3418387 1060 ns/op # String matching (prefix or suffix)
go.cpu.labels.pprof-10 2535682 1415 ns/op # String matching + sampling
I'm curious how that's possible – probably because the string table is bloated with label values – I guess we should only test relevant strings (function members).
Anyways, I like your PR, and I'm pretty sure that GetLanguage
will be barely noticeable. Otherwise, there are ways to improve it further:
We inspect up to N random functions, e.g., 3%. For each selected function:
Optionally, if we're not satisfied with the false-positive rate, we could inspect all the sampled functions and make decision at the end, based on the confidence levels.
We're currently using a set of regular expressions to derive the source code language based on the string table. This works well but the overhead might be quite substantial:
We should probably consider optimizing the procedure. Some ideas:
Nevertheless, we should update all our SDKs to send some meta (such as language, version, platform, etc)