Open mschurr opened 1 year ago
It's not clear to me that "distinct" would be the correct description of what you have when evaluating a window within a time series. It seems like if you are observing users, this would track "new" users within the window. For example, userA
is observed before the window, userA
is observed again inside the window. In that case, userA
would not be newly counted in the output from a query over that window. Am I missing something?
HLLs can be thought of as roughly equivalent to hash sets, but with a few limitations. They only support a few operations (insert, union, and len). You cannot test membership, remove items, or iterate over items. Additionally, len returns an estimate rather than an exact value. The trade-off is that the memory usage is constant (irrespective of the number of items placed into the set) and all operations are also constant time. For a timeseries database, this is perfect, because you only need those three operations, and you need them to be fast.
My proposal is to allow counting the number of distinct things placed into a (hyperloglog) set during an aggregation window. When merging across aggregation windows (or across series), we would take the (hyperloglog) set union.
Let's say each time we scrape, we are collecting the HLLSets containing the data for the last 1 minute, e.g.:
# Shows what the scrape endpoint exposes at each time.
@time t1
distinct_ips_by_domain{cdn="cloudflare", domain="google.com"} HLLSet<{"50.50.50.1"}>
distinct_ips_by_domain{cdn="cloudflare", domain="facebook.com"} HLLSet<{"60.60.60.1"}>
distinct_ips_by_domain{cdn="cloudflare", domain="netflix.com"} HLLSet<{"70.70.70.1"}>
@time t1 - 1m
distinct_ips_by_domain{cdn="cloudflare", domain="google.com"} HLLSet<{"50.50.50.2"}>
distinct_ips_by_domain{cdn="cloudflare", domain="facebook.com"} HLLSet<{"60.60.60.2"}>
distinct_ips_by_domain{cdn="cloudflare", domain="netflix.com"} HLLSet<{"70.70.70.2"}>
@time t1 - 2m
distinct_ips_by_domain{cdn="cloudflare", domain="google.com"} HLLSet<{"50.50.50.3"}>
distinct_ips_by_domain{cdn="cloudflare", domain="facebook.com"} HLLSet<{"60.60.60.3"}>
distinct_ips_by_domain{cdn="cloudflare", domain="netflix.com"} HLLSet<{"70.70.70.3"}>
So, if I queried distinct_ips_by_domain{domain="google.com"}[3m] @ t1
, I would get back the following RangeVector:
{cdn="cloudflare", domain="google.com"}
HLLSet<{"50.50.50.1"}> @ t1
HLLSet<{"50.50.50.2"}> @ t1 - 1m
HLLSet<{"50.50.50.3"}> @ t1 - 2m
I can then calculate the cardinality across time:
cardinality(
union_over_time(
distinct_ips_by_domain{domain="google.com"}[3m] @ t1
)
)
==
cardinality(
union_over_time(
{cdn="cloudflare", domain="google.com"}
HLLSet<{"50.50.50.1"}> @ t1
HLLSet<{"50.50.50.2"}> @ t1 - 1m
HLLSet<{"50.50.50.3"}> @ t1 - 2m
)
)
==
cardinality(
{cdn="cloudflare", domain="google.com"} HLLSet<{"50.50.50.1", "50.50.50.2", "50.50.50.3"}>
)
==
{cdn="cloudflare", domain="google.com"} 3
I can also calculate the cardinality across series:
cardinality(
union by (cdn) (
distinct_ips_by_domain @ t1
)
)
==
cardinality(
union by (cdn) (
distinct_ips_by_domain{cdn="cloudflare", domain="google.com"} HLLSet<{"50.50.50.1"}>
distinct_ips_by_domain{cdn="cloudflare", domain="facebook.com"} HLLSet<{"60.60.60.1"}>
distinct_ips_by_domain{cdn="cloudflare", domain="netflix.com"} HLLSet<{"70.70.70.1"}>
)
)
==
cardinality(
{cdn="cloudflare"} HLLSet<{"50.50.50.1", "60.60.60.1", "70.70.70.1"}>
)
==
{cdn="cloudflare"} 3
I could combine the two as well:
cardinality(union by (cdn) (union_over_time(distinct_ips_by_domain[3m] @ t1)))
==
{cdn="cloudflare"} 9
So, in Grafana, if I wanted to know the number if distinct IPs that accessed each domain in the selected time window, I could simply write: cardinality(union_over_time(distinct_ips_by_domain[$__range]))
. If I wanted it as a timeseries, I could just replace $range with $rate_interval. I can also use union
to aggregate across labels if needed.
Note that the intermediate HLLSets are not really meaningfully displayable in any way (they are just binary data), which is why I suggest making it an error to write a query that exposes the raw sets. In other words, cardinality
must always be used in the query to turn the sets back to scalars.
I also suggest that the scrape endpoint exposes partially overlapping data in consecutive scrapes (maybe for up to 3 scrape intervals?) to make data loss less likely when a scrape fails.
Without having looked into this in detail, this is definitely one of those "touch everything throughout the stack" changes. I assume it deserves a discussion during the dev-summit (which is in summer break right now, but we'll have a big and fat in-person summit in September). I have added an item to the agenda. This should not keep anybody from discussing the proposal here on this issue in the meantime.
@mschurr Thank you for the great and thoroughly written-out proposal! This seems clearly useful, including the justification for needing server-side changes (for the ability to aggregate over dimensions and time ranges). As @beorn7 said it's definitely a huge change that would touch everything. @beorn7 just went through the same thing with the new native histogram metric type, all the way from instrumentation and bit-level details in the TSDB to new PromQL functionality, so he will be in a great position to judge how much work this really would be. But generally I'm not opposed if this finds a few interested implementors and owners.
Would base64 make sense, instead of hex encoding, for the text format?
@bernot-dev Potentially, unless there's anything intelligible for debugging that a human could read out of the hex values, but I doubt it. In any case, I think the exact text format encoding is a design decision that can be finalized at the very end. For native histograms, a new protobuf format was introduced for more efficient encoding, and only now are people thinking about also adding native histograms directly to the text format. This new metric type (iff we decide to go for it) could also use the protobuf format first.
The hyperloglog registers array (I think some of the other projects have taken to calling them hyperloglog sketches) isn't useful for human consumption. I guess if you wanted something useful to appear in the text format you could add the cardinality count in a comment (just keep in mind that cardinalities aren't scrapable, because you can't merge set lengths together, but they might be helpful for a human trying to debug by visiting the scrape endpoint in a browser).
BigQuery docs have some good examples: https://cloud.google.com/bigquery/docs/reference/standard-sql/hll_functions
Sorry for the long delay, the dev-summit finally got to discussing this.
Part of the reason why this was on the backburner for so long is that it is really hard to come to a firm conclusion here.
The feature looks really neat, but nobody among the (certainly non-representative) attendees of the dev-summit could claim that they always wanted that. (Compare that to native histograms, which is a feature that at least some of us had desired for a long time. It still took very long to implement (first thoughts 2015, design doc 2021, first experimental release 2022, stable release maybe late 2024), because we knew it would require changes throughout the stack, and there were so many other tasks to do that promised a quicker return on similarly desired features and improvements.)
The feature might still be super useful. It might be one of those features where you only realize that you were missing them once you tried them.
However, while it doesn't need to change every piece of the stack (like native histograms), it still touches most of them (instrumentation, exposition format, TSDB, PromQL). Especially the TSDB part is quite critical here. After adding native histograms, we found ourselves with a lot of switch statements sprinkled all over the code to treat the various sample types differently. This approach clearly doesn't scale beyond what we have. Adding yet another sample type would require a pretty deep redesign of the code structure (and to be fair, it would be good to do that even if we don't add a new sample type anytime soon). Along a similar line, I would even argue that PromQL itself needs to become more type-aware in general, and even more so if we want to add yet another metric type.
so there are a lot of "ifs" to this proposal:
The usual next step at this stage would be to write a design doc over in https://github.com/prometheus/proposals . However, that's already quite a work investment, and we would not recommend it until the are developers willing to commit to the project. Otherwise, even an approved design doc might just sit there for years because nobody has the time to implement it.
The good news is that we didn't find any conceptual blockers at the dev-summit. From what we understood, it should fit into the Prometheus data and execution model. We will leave this issue open so that we don't forget about it (and to give people the opportunity to discover it and pick it up).
Proposal
I would like to propose a new metric type: Distinct Count.
A distinct count records the number of unique things placed into a set. However, exact precision is not required for performance monitoring/alerting, so can we use a probability-based data structure called HyperLogLog.
I think the value such a metric type can provide is fairly obvious, such as tracking the number of unique users/visitors accessing the service during a time window.
Proposed Implementation: HyperLogLog Sets
A hyperloglog looks something like this:
Notice that this memory allocation is constant-sized, irrespective of how many values are inserted into the set, while supporting accurate cardinality estimations of up to 2^64. Additionally, all operations on the set are constant-time.
HyperLogLogs have been used to great success in tools like BigQuery, Redis, Presto, etc.
For a time series database, the only thing that matters is merge performance, so it’s absolutely critical to use uint8_t to store registers and ensure that the merge operation is vectorized by the compiler, even if this means using CGo.
Proposed Exporter Changes (python example):
Note: the client should keep at least 3 scrape intervals worth of HyperLogLog sets in a rotating circular buffer. The set union of these sets is what would be exposed on the scrape endpoint. This is similar to how summaries handle quantiles, and makes data loss unlikely in the event of a failed scrape (at the cost of some precision).
Proposed Exposition Format Changes (text example):
Proposed PromQL Changes:
Other Implications:
I don’t think any changes would be required by downstream consumers (like Grafana or the Prometheus UI), assuming that
cardinality()
is always used in the query to convert the values back to scalars. Perhaps we could make it an error to write a PromQL query that exposes the intermediate HLL state in its output?