Open bryanhuhta opened 2 months ago
I'm wondering how the workflow changes. If I understand correctly, we will still be querying label values for the highest cardinality labels (the top 7, I believe). That makes me wonder if this approach is helpful in a common case:
I'm not sure if the number of requests matters much (unless we're talking about thousands of them, of course). In our case, I think there are two key factors:
I'm curious if we want to display a cardinality estimate of the labels. The current approach of summing values might give us somewhat trustworthy results if the values are normalized for the top/bottom K (though I'm not entirely sure). However, it's definitely not acceptable for display purposes.
I'd suggest using HyperLogLog++. There's a great Go implementation that should fit our use case perfectly: https://github.com/clarkduvall/hyperloglog. If we can tolerate a small skew of a few percent, we could potentially skip subsequent LabelValues calls entirely. Additionally, I'd consider displaying rough indicators instead of exact distinct counts – something like: 10+, 25+, 50+, 100+, 1k+, etc (with exact values for low cardinality labels, see below).
I'd consider adding a new API method.
Long time ago, we've been discussing a method to get summary on dimensions of a dataset, and we even drafted API – here's a slightly modified version:
message DimensionsSummary {
uint64 cardinality = 1; // number of unique dimension combinations (series).
uint64 samples = 2; // total number of samples in the time range.
repeated DimensionSummary dimensions = 3; // distribution of samples across dimensions.
}
message DimensionSummary {
string name = 1;
uint64 cardinality = 2; // number of unique dimension combinations that include the dimension (series).
uint64 samples = 3; // total number of samples in the time range that include the dimension (any value).
uint64 dimension_cardinality = 4; // total number of distinct dimension values.
repeated DimensionValueSummary values = 5; // some sample values (e.g., top/bottom K).
}
message DimensionValueSummary {
string name = 1;
uint64 cardinality = 2; // number of unique dimension combinations that include the value (series).
uint64 samples = 3; // total number of samples in the time range that include the value.
}
We can simplify it further:
message LabelsSummary {
repeated LabelSummary labels = 1;
}
message LabelSummary {
// number of unique label values.
uint64 label_cardinality_estimate = 1;
// some sample label values (e.g., top/bottom K, or all values,
// if a label only has very few values – say, less than 50).
repeated string values = 2;
}
Please note that the API is public, so for internal communications, we should use something that handles HLL++. For example:
message LabelsSummary {
repeated LabelSummary labels = 1;
}
message LabelSummary {
// Opaque HyperLogLog representation that counts label values.
bytes label_cardinality_estimate = 1;
// If present, the summary includes all values: len(values) is the exact cardinality of the label.
repeated string values = 2;
}
Just to clarify: shouldn't "top K" actually be "bottom K (distinct)"? I assume we don't want to group by labels with the highest cardinality. This is a tricky point because we need to strike a balance – grouping by low cardinality labels doesn’t make much sense either. It seems like we need to identify the "core" of the multi-dimensional dataset (similar to what PCA or kernel methods do). This could be a very interesting research area but is definitely out of scope for the PR.
I'm not sure if the number of requests matters much
I'd argue the number of requests does matter. It may be easy for our backend to serve these requests, but each additional requests the UI needs to make is extra strain on the client's network and introduces a failure case the UI needs to handle. There definitely is incentive for the UI to minimize the number of requests it makes. We should respect the fact we need to serve clients who may have a slow or unreliable network and minimizing failure cases on the frontend is always a good thing.
I'm curious if we want to display a cardinality estimate of the labels
I need to reword the PR body, but these estimates are for ordering the visualizations and treating extremely high cardinality labels differently in the UI. There's two problems at play here:
request_id
label with 50k values, we don't want to go ask for all 50k values and display them in the query builder dropdown).For problem 1, we do need a precise (or pretty precise) number, which is likely out of scope for this PR (but might incidentally be solved, depending on our approach for problem 2). For problem 2, we just need numbers that allow us to order the label cardinalities w.r.t. to each other and to trim labels with cardinalities that are far too high.
I'd consider adding a new API method.
This suggestion is very compelling. I would like to explore this more!
Just to clarify: shouldn't "top K" actually be "bottom K (distinct)"?
Yes, your are correct in this observation; Bottom K (distinct) is a more appropriate description of this operation.
closes https://github.com/grafana/pyroscope/issues/3226
Overview
This PR is a POC change that allows requesting estimated label value cardinality alongside label names. This will allow the UI to make substantially fewer requests in order to populate the the label view (see below).
In order for the UI to render the labels bar, it needs to:
/LabelNames
request/LabelValues
requestThis is quite expensive in terms of network cost to select the top N labels by cardinality.
Approach
This PR modifies the
/LabelNames
API to allow a caller to optionally request estimated cardinality:This would provide the following response:
where
names
andestimated_cardinality
are a 1:1 correspondence (if cardinality is requested). If cardinality is not requested,estimated_cardinality
will be empty.Note that the cardinality returned is an estimate. This PR does not deduplicate merged responses. This is fine, as the UI needs a way to select the top N labels by cardinality to render the label bar. It can make subsequent requests to
/LabelValues
to find exact counts. This should reduce network calls from 20 or more to ~7.Performance
I ran three 10 minute load tests, each requesting 24 hours of data for
service_name="fire-dev-001/querier
./LabelNames
with no cardinality/LabelNames
with cardinality/LabelNames
with no cardinality followed by N/LabelValues
TODO(bryan): Attach benchmark results
/LabelNames
with no cardinality/LabelNames
with cardinality/LabelNames
//LabelValues
Explore Profiles view
Results
As expected, this PR puts more pressure on store gateways fulfilling a
/LabelNames
requests with cardinality as they need to also look up the label value when scanning postings. However, I think this tradeoff is acceptable as it reduces subsequent network calls the store gateways need to handle. So a/LabelNames
request that is 2x more expensive alleviates the need for 15 more cheaper requests.Alternatives
This PR provides one possible approach, which is to reuse the
/LabelNames
endpoint. An alternative is to create a brand new endpoint/LabelCardinality
. This has the advantage of making performance characteristics much more predictable. E.g. a/LabelNames
call with cardinality isn't suddenly much more expensive. We could also tune the endpoint to do the "top N" calculation server-side and provide the exact cardinality instead of an estimate.A drawback, of course, is yet another API endpoint specifically to power a single component in the UI.