Optimize language detection

kolesnikovae commented 11 months ago

We're currently using a set of regular expressions to derive the source code language based on the string table. This works well but the overhead might be quite substantial:

We should probably consider optimizing the procedure. Some ideas:

Sampling: it makes little sense to inspect the whole table. I assume that sampling 5-15% of strings should be sufficient.
Short cuts: in most cases we can confidently determine the language based on the file extensions. We should probably try this first (without regexes, just suffix matching).
Skip the procedure if we can get this info from labels (spy_name).

Nevertheless, we should update all our SDKs to send some meta (such as language, version, platform, etc)

cyriltovena commented 11 months ago

We actually don't need SDK to do this.

@aleks-p Recently added this into our usagestats, the same discovery process could be used at ingest time.

The big question though is how store and introduce it in the API ?

We could store it in tsdb (using a hidden labels) or in parquet.

For pprof API we can use sample labels, but not sure for the SelectMergeStacktraces API

kolesnikovae commented 11 months ago

We actually don't need SDK to do this.

Knowledge of the exact client SDK version might be very valuable when it comes to parsing of the payload (example). Over time we will inevitably want to deprecate some old versions, which is way easier when we do know the version and not guess it.

Moreover, I believe we should be able to identify the profile origin, not just SDK. For example, we could introduce __origin__ label that follows well known User Agent semantics, or yet better use multiple labels, to make it more structured.

I'd like to hear your opinion @korniltsev

@aleks-p Recently added this into our usagestats, the same discovery process could be used at ingest time.

The issue is about optimisation of this process: it consumes up to 8-12% of CPU time in distributors (6 CPUs solely in our biggest cluster). Not sure if I get your idea of using this at ingest time – if you mean that we should propagate the value (e.g to spy_name label which we already have for this purpose), then I'm not sure how we're going to use it later. Although, I think it makes sense to perform the detection earlier, at parsing time, because in some cases we rely on spyName API parameter but it may be missing.

The big question though is how store and introduce it in the API ? We could store it in tsdb (using a hidden labels) or in parquet. For pprof API we can use sample labels, but not sure for the SelectMergeStacktraces API

For me the biggest question is how this information is supposed to be used. The only use case I'm aware of is flamegraph colouring and language-specific decoration of function names. But for that we need to know the language of each function (call-site) because each stack trace may include frames from different runtimes (userspace/kernel, FFI, etc). I'd say that "language" is an attribute of the mapping (in pprof terms; typically, the binary object/image the code is loaded from). I'd even say that the very idea of "profile-wise language" is somewhat limited; it works well for statistical purposes but not beyond.

Although, we could potentially inform frontend about the "major" language, I'm not sure if it's worth it: frontend can do this on its own easily. Randomly sample some functions from the profile, check regexes, and voilà – you know the "major" language of the profile.

I think that this topic intersects with our another discussion about displaying versions of the application (function breakdown by version) in the flamegraph.

aleks-p commented 11 months ago

@cyriltovena perhaps you misunderstood, this issue is primarily about optimizing the work done for the usage stats, as it looks like it is using quite a bit of cpu right now. Sampling, substring matching and skipping altogether if we can are all good ways to do this.

cyriltovena commented 11 months ago

perhaps you misunderstood,

Sorry about that. Yes that makes total sense to try to do this early, then sampling.

aleks-p commented 11 months ago

Regexp:

go.cpu.labels.pprof-10            330040         10853 ns/op
heap-10                           195819         18407 ns/op
dotnet.labels.pprof-10            143966         24949 ns/op
profile_java-10                    61840         58016 ns/op
profile_nodejs-10                 109437         32871 ns/op
profile_python-10                 391418          9156 ns/op
profile_ruby-10                   307896         11783 ns/op
profile_rust-10                   636633          5668 ns/op

String matching (prefix or suffix):

go.cpu.labels.pprof-10           3418387          1060 ns/op
heap-10                          2446440          1483 ns/op
dotnet.labels.pprof-10           1751536          2057 ns/op
profile_java-10                  2333244          1541 ns/op
profile_nodejs-10                1722144          2083 ns/op
profile_python-10                4395427           821 ns/op
profile_ruby-10                  4546932           793 ns/op
profile_rust-10                  2960469          1221 ns/op

String matching + sampling:

go.cpu.labels.pprof-10           2535682          1415 ns/op
heap-10                          7298155           493 ns/op
dotnet.labels.pprof-10           1748709          2064 ns/op
profile_java-10                  3828352           940 ns/op
profile_nodejs-10                1721338          2087 ns/op
profile_python-10                7062268           513 ns/op
profile_ruby-10                  9947100           364 ns/op
profile_rust-10                  2941592          1225 ns/op

I'll go with option 2 (no sampling) for now. Sampling makes things better at times but it increases the likelihood of failing to detect a language.

kolesnikovae commented 11 months ago

Looks good!

Sampling benchmarking vastly depends on the input data. In the worst case, when none of the patterns matches, the benefit will be way more pronounced.

go.cpu.labels.pprof-10           3418387          1060 ns/op # String matching (prefix or suffix)
go.cpu.labels.pprof-10           2535682          1415 ns/op # String matching + sampling

I'm curious how that's possible – probably because the string table is bloated with label values – I guess we should only test relevant strings (function members).

Anyways, I like your PR, and I'm pretty sure that GetLanguage will be barely noticeable. Otherwise, there are ways to improve it further:

We inspect up to N random functions, e.g., 3%. For each selected function:

We get extension of the file where it is defined (if the file path is present at all).
We check if this is a known extension (no maps, just a global list of extension-language pairs ordered by expected frequency).
If there is a match, the language is determined, return.
Check known function name prefixes (no maps, just a global list of prefix-language pairs ordered by expected frequency).
If there is a match, the language is determined, return.

Optionally, if we're not satisfied with the false-positive rate, we could inspect all the sampled functions and make decision at the end, based on the confidence levels.

grafana / pyroscope

Optimize language detection #2777