Open seebs opened 11 months ago
Someone in gopher slack #performance pointed out that, in fact, it's not zero that's magical. It's antisorted behavior; for instance, a 2 followed by many 1s gets the same failure mode as a 1 followed by many 0s.
Looking at it, it looks like all the CPU time is happening in copies inside an insert, and looking at that, it looks like probably something is finding the right place in a slice to insert things, so if inputs come in descending order, it's probably ending up going quadratic, while unsorted inputs are fine. It's just that the case of "occasional non-zero value followed by lots of zeros" matches that behavior.
Thanks for an extensive investigation! Also saw discussion on Go performance channel, good job! I think it's enough information for us to try to optimize things or at least understand the limitations.
I never looked into the summary code, but happy to have fun checking it out in January. In the meantime help wanted (: Perhaps @beorn7 might know straight away too.
PS: My experience from practice is that a summary type has very specific use cases that are generally even less applicable with native histograms (more efficient and with automated bucketing). The client side computation is a big downside of summaries indeed, no matter how we optimize things here. In other words, I wonder if you considered moving to native histograms for your use case 🙈 (:
The summaries use the classical algorithm often called CKMS (by the initials of its authors), see https://grafana.com/blog/2022/03/01/how-summary-metrics-work-in-prometheus/ for some detail.
The Go implementation was originally by bmizerany. One day, we moved to a fork of mine because merging of bug fixes upstream was delayed, and then, somehow, the fork evolved more and more, including quite serious fixes, so we have used my fork ever since.
Given that somewhat wild history of the implementation, I wouldn't be surprised if there are still errors in the implementation. Fixes are welcome (see https://github.com/beorn7/perks ). Equally possible is the option that the problems are just part of the CKMS algorithm itself, and there is nothing we can do except moving everything to a different algorithm. (There was such an attempt in the past, but it didn't go anywhere.)
I could try out the histogram thing, I think for our use case it's slightly more convenient if the quantile is present in the reported data but we could probably adapt if we had to. We reduced the number of items being reported a bit and it stopped being an urgent problem for us.
I was recently performing a benchmark and captured this high CPU and high memory consumption due to sudden flood traffic. It looked to me like it was related to what is reported here regarding summary metrics especially the quantile stream.
using client_golang v1.17.0
This is probably one of the strangest things I've ever encountered, and I struggled quite a bit trying to reproduce it. Under some workloads, we were seeing the flush in a summary metric consume >60% of total program CPU time, which is obviously insane. With slightly different workloads, though, we didn't see maybe 40% or 50%; we saw so little CPU that we had to search for it in the profile to see it at all.
I've put up a small test program on the go playground: https://go.dev/play/p/VUpA2pLNQtS
To summarize:
So, for instance, with the provided test program, on my laptop:
It's particularly notable to me that it's specific to the value 0. If we write occasional
1
in a stream that's otherwise all2
, that's fine, but if the stream is otherwise all0
, we get this pathological behavior.I doubt that you actually need the 2M values a second to trigger this at all, but it's certainly easier to see at that point.
We're not in any urgent need of a fix, since in the actual use case in question, the events that were generating the flood of 0 metrics were amenable to "well don't do that then" as a solution, but I feel like "there's a stream of data of which a whole lot is 0s but not quite all" must be a case that happens occasionally. (This might be a partial explanation for the other issue I see suggesting a linked list instead of a slice, since that looks like about where in the code a profile showed this acting up for me, too.)