prometheus / client_golang

Prometheus instrumentation library for Go applications
https://pkg.go.dev/github.com/prometheus/client_golang
Apache License 2.0
5.4k stars 1.18k forks source link

summary type metric: pathological performance degradation when observing lots of zeros #1412

Open seebs opened 10 months ago

seebs commented 10 months ago

using client_golang v1.17.0

This is probably one of the strangest things I've ever encountered, and I struggled quite a bit trying to reproduce it. Under some workloads, we were seeing the flush in a summary metric consume >60% of total program CPU time, which is obviously insane. With slightly different workloads, though, we didn't see maybe 40% or 50%; we saw so little CPU that we had to search for it in the profile to see it at all.

I've put up a small test program on the go playground: https://go.dev/play/p/VUpA2pLNQtS

To summarize:

So, for instance, with the provided test program, on my laptop:

$ ./summary -procs 8 -first -value 2 -rate 2080000 -secs 10
8 procs, 2080000 samples per second for 10 secs, 2080/ms, 260/proc/ms
1000 rounds, 999.968101ms elapsed
2000 rounds, 1.000117226s elapsed
3000 rounds, 999.884346ms elapsed
4000 rounds, 1.000144866s elapsed
5000 rounds, 999.953162ms elapsed
6000 rounds, 999.951624ms elapsed
7000 rounds, 999.975637ms elapsed
8000 rounds, 1.000083589s elapsed
9000 rounds, 999.938046ms elapsed
9.999977264s elapsed
0:$ ./summary -procs 8 -first -value 0 -rate 2080000 -secs 10
8 procs, 2080000 samples per second for 10 secs, 2080/ms, 260/proc/ms
1000 rounds, 1.000091221s elapsed
2000 rounds, 1.028908993s elapsed
3000 rounds, 1.373470493s elapsed
4000 rounds, 1.997987326s elapsed
5000 rounds, 3.407858654s elapsed
6000 rounds, 4.003588481s elapsed
7000 rounds, 4.748205114s elapsed
8000 rounds, 5.465365788s elapsed
9000 rounds, 6.129958022s elapsed
35.764543742s elapsed
0:$ ./summary -procs 8 -value 0 -rate 2080000 -secs 10
8 procs, 2080000 samples per second for 10 secs, 2080/ms, 260/proc/ms
1000 rounds, 1.000075348s elapsed
2000 rounds, 999.975518ms elapsed
3000 rounds, 1.00003006s elapsed
4000 rounds, 999.994831ms elapsed
5000 rounds, 1.000009216s elapsed
6000 rounds, 999.987056ms elapsed
7000 rounds, 999.998651ms elapsed
8000 rounds, 1.0000061s elapsed
9000 rounds, 999.992415ms elapsed
10.000071504s elapsed

It's particularly notable to me that it's specific to the value 0. If we write occasional 1 in a stream that's otherwise all 2, that's fine, but if the stream is otherwise all 0, we get this pathological behavior.

I doubt that you actually need the 2M values a second to trigger this at all, but it's certainly easier to see at that point.

We're not in any urgent need of a fix, since in the actual use case in question, the events that were generating the flood of 0 metrics were amenable to "well don't do that then" as a solution, but I feel like "there's a stream of data of which a whole lot is 0s but not quite all" must be a case that happens occasionally. (This might be a partial explanation for the other issue I see suggesting a linked list instead of a slice, since that looks like about where in the code a profile showed this acting up for me, too.)

seebs commented 10 months ago

Someone in gopher slack #performance pointed out that, in fact, it's not zero that's magical. It's antisorted behavior; for instance, a 2 followed by many 1s gets the same failure mode as a 1 followed by many 0s.

Looking at it, it looks like all the CPU time is happening in copies inside an insert, and looking at that, it looks like probably something is finding the right place in a slice to insert things, so if inputs come in descending order, it's probably ending up going quadratic, while unsorted inputs are fine. It's just that the case of "occasional non-zero value followed by lots of zeros" matches that behavior.

bwplotka commented 10 months ago

Thanks for an extensive investigation! Also saw discussion on Go performance channel, good job! I think it's enough information for us to try to optimize things or at least understand the limitations.

I never looked into the summary code, but happy to have fun checking it out in January. In the meantime help wanted (: Perhaps @beorn7 might know straight away too.

PS: My experience from practice is that a summary type has very specific use cases that are generally even less applicable with native histograms (more efficient and with automated bucketing). The client side computation is a big downside of summaries indeed, no matter how we optimize things here. In other words, I wonder if you considered moving to native histograms for your use case 🙈 (:

beorn7 commented 10 months ago

The summaries use the classical algorithm often called CKMS (by the initials of its authors), see https://grafana.com/blog/2022/03/01/how-summary-metrics-work-in-prometheus/ for some detail.

The Go implementation was originally by bmizerany. One day, we moved to a fork of mine because merging of bug fixes upstream was delayed, and then, somehow, the fork evolved more and more, including quite serious fixes, so we have used my fork ever since.

Given that somewhat wild history of the implementation, I wouldn't be surprised if there are still errors in the implementation. Fixes are welcome (see https://github.com/beorn7/perks ). Equally possible is the option that the problems are just part of the CKMS algorithm itself, and there is nothing we can do except moving everything to a different algorithm. (There was such an attempt in the past, but it didn't go anywhere.)

seebs commented 10 months ago

I could try out the histogram thing, I think for our use case it's slightly more convenient if the quantile is present in the reported data but we could probably adapt if we had to. We reduced the number of items being reported a bit and it stopped being an urgent problem for us.