Feature Request: Distinct Count Metric Type

Proposal

I would like to propose a new metric type: Distinct Count.

A distinct count records the number of unique things placed into a set. However, exact precision is not required for performance monitoring/alerting, so can we use a probability-based data structure called HyperLogLog.

I think the value such a metric type can provide is fairly obvious, such as tracking the number of unique users/visitors accessing the service during a time window.

Proposed Implementation: HyperLogLog Sets

A hyperloglog looks something like this:

struct hll {
  uint8_t precision;  // 9-16, 12 might be a reasonable default
  uint8_t registers[1 << precision];  // Zero-initialized.
}

typedef uint64_t hll_hash_t;
hll_hash_t hll_hash(const void *item, size_t nbytes);  // Use a fast hash (e.g. XXH64).

struct hll *hll_alloc(); 
void hll_free(struct hll *hll);
void hll_insert(struct hll *hll, hll_hash_t hash); // Adds a new item into the set.
void hll_merge(struct hll *dst, const struct hll *src);  // Calculates the set union.
double hll_cardinality(const struct hll *hll);  // Returns the cardinliaty of the set - an estimation with accuracy 1.04/sqrt(1 << precision).

Notice that this memory allocation is constant-sized, irrespective of how many values are inserted into the set, while supporting accurate cardinality estimations of up to 2^64. Additionally, all operations on the set are constant-time.

HyperLogLogs have been used to great success in tools like BigQuery, Redis, Presto, etc.

For a time series database, the only thing that matters is merge performance, so it’s absolutely critical to use uint8_t to store registers and ensure that the merge operation is vectorized by the compiler, even if this means using CGo.

inline void hll_merge(struct hll *dst, const struct hll *src) {
  ASSERT(src->precision == dst->precision);
  // MUST vectorize to _mm_max_epu8 (PMAXUB) SSE/AVX instruction.
  for (uint32_t i = 0; i < ARRAY_SIZE(dst->registers); i++) {
    dst->registers[i] = MAX(dst->registers[i], src->registers[i]);
  }
}

Proposed Exporter Changes (python example):

metric = prometheus_client.DistinctCounter(
    "distinct_users",
    "The number of unique users that accessed the service.", 
    ["example"],
    precision=12,  # optional, default=12
)
metric.labels("example").observe("username1")
metric.labels("example").observe("username2")
metric.labels("example").observe("username3")

Note: the client should keep at least 3 scrape intervals worth of HyperLogLog sets in a rotating circular buffer. The set union of these sets is what would be exposed on the scrape endpoint. This is similar to how summaries handle quantiles, and makes data loss unlikely in the event of a failed scrape (at the cost of some precision).

Proposed Exposition Format Changes (text example):

# HELP distinct_users The number of unique users that accessed the service.
# TYPE distinct_users distinct_counter
# PRECISION distinct_users 12
distinct_users{example="example"} 0x0F9AC... (hex-encoded struct hll)

Proposed PromQL Changes:

# Calculates the hyperloglog set union across series.
union [by|without (label, ...)] (InstantVector[HLL]) -> InstantVector[HLL]

# Calculates the hyperloglog set union over time.
union_over_time(RangeVector[HLL]) -> InstantVector[HLL]

# Returns the cardinality estimate for a hyperloglog set.
cardinality(InstantVector[HLL]) -> InstantVector[Scalar]

# Returns the error in the cardinality estimation for a hyperloglog set.
error(InstantVector[HLL]) -> InstantVector[Scalar]

Other Implications:

I don’t think any changes would be required by downstream consumers (like Grafana or the Prometheus UI), assuming that cardinality() is always used in the query to convert the values back to scalars. Perhaps we could make it an error to write a PromQL query that exposes the intermediate HLL state in its output?

It's not clear to me that "distinct" would be the correct description of what you have when evaluating a window within a time series. It seems like if you are observing users, this would track "new" users within the window. For example, userA is observed before the window, userA is observed again inside the window. In that case, userA would not be newly counted in the output from a query over that window. Am I missing something?

HLLs can be thought of as roughly equivalent to hash sets, but with a few limitations. They only support a few operations (insert, union, and len). You cannot test membership, remove items, or iterate over items. Additionally, len returns an estimate rather than an exact value. The trade-off is that the memory usage is constant (irrespective of the number of items placed into the set) and all operations are also constant time. For a timeseries database, this is perfect, because you only need those three operations, and you need them to be fast.

My proposal is to allow counting the number of distinct things placed into a (hyperloglog) set during an aggregation window. When merging across aggregation windows (or across series), we would take the (hyperloglog) set union.

Let's say each time we scrape, we are collecting the HLLSets containing the data for the last 1 minute, e.g.:

# Shows what the scrape endpoint exposes at each time.

@time t1
distinct_ips_by_domain{cdn="cloudflare", domain="google.com"} HLLSet<{"50.50.50.1"}>
distinct_ips_by_domain{cdn="cloudflare", domain="facebook.com"} HLLSet<{"60.60.60.1"}>
distinct_ips_by_domain{cdn="cloudflare", domain="netflix.com"} HLLSet<{"70.70.70.1"}>

@time t1 - 1m
distinct_ips_by_domain{cdn="cloudflare", domain="google.com"} HLLSet<{"50.50.50.2"}>
distinct_ips_by_domain{cdn="cloudflare", domain="facebook.com"} HLLSet<{"60.60.60.2"}>
distinct_ips_by_domain{cdn="cloudflare", domain="netflix.com"} HLLSet<{"70.70.70.2"}>

@time t1 - 2m
distinct_ips_by_domain{cdn="cloudflare", domain="google.com"} HLLSet<{"50.50.50.3"}>
distinct_ips_by_domain{cdn="cloudflare", domain="facebook.com"} HLLSet<{"60.60.60.3"}>
distinct_ips_by_domain{cdn="cloudflare", domain="netflix.com"} HLLSet<{"70.70.70.3"}>

So, if I queried distinct_ips_by_domain{domain="google.com"}[3m] @ t1, I would get back the following RangeVector:

{cdn="cloudflare", domain="google.com"} 
    HLLSet<{"50.50.50.1"}> @ t1
    HLLSet<{"50.50.50.2"}> @ t1 - 1m
    HLLSet<{"50.50.50.3"}> @ t1 - 2m

I can then calculate the cardinality across time:

cardinality(
    union_over_time(
        distinct_ips_by_domain{domain="google.com"}[3m] @ t1
    )
)
==
cardinality(
    union_over_time(
        {cdn="cloudflare", domain="google.com"} 
            HLLSet<{"50.50.50.1"}> @ t1
            HLLSet<{"50.50.50.2"}> @ t1 - 1m
            HLLSet<{"50.50.50.3"}> @ t1 - 2m
    )
)
==
cardinality(
    {cdn="cloudflare", domain="google.com"} HLLSet<{"50.50.50.1", "50.50.50.2", "50.50.50.3"}>
)
==
{cdn="cloudflare", domain="google.com"} 3

I can also calculate the cardinality across series:

cardinality(
    union by (cdn) (
        distinct_ips_by_domain @ t1
    )
)
==
cardinality(
    union by (cdn) (
        distinct_ips_by_domain{cdn="cloudflare", domain="google.com"} HLLSet<{"50.50.50.1"}>
        distinct_ips_by_domain{cdn="cloudflare", domain="facebook.com"} HLLSet<{"60.60.60.1"}>
        distinct_ips_by_domain{cdn="cloudflare", domain="netflix.com"} HLLSet<{"70.70.70.1"}>
    )
)
==
cardinality(
    {cdn="cloudflare"} HLLSet<{"50.50.50.1", "60.60.60.1", "70.70.70.1"}>
)
==
{cdn="cloudflare"} 3

I could combine the two as well:

cardinality(union by (cdn) (union_over_time(distinct_ips_by_domain[3m] @ t1)))
==
{cdn="cloudflare"} 9

So, in Grafana, if I wanted to know the number if distinct IPs that accessed each domain in the selected time window, I could simply write: cardinality(union_over_time(distinct_ips_by_domain[$__range])). If I wanted it as a timeseries, I could just replace $range with $rate_interval. I can also use union to aggregate across labels if needed.

Note that the intermediate HLLSets are not really meaningfully displayable in any way (they are just binary data), which is why I suggest making it an error to write a query that exposes the raw sets. In other words, cardinality must always be used in the query to turn the sets back to scalars.

I also suggest that the scrape endpoint exposes partially overlapping data in consecutive scrapes (maybe for up to 3 scrape intervals?) to make data loss less likely when a scrape fails.

Without having looked into this in detail, this is definitely one of those "touch everything throughout the stack" changes. I assume it deserves a discussion during the dev-summit (which is in summer break right now, but we'll have a big and fat in-person summit in September). I have added an item to the agenda. This should not keep anybody from discussing the proposal here on this issue in the meantime.

@mschurr Thank you for the great and thoroughly written-out proposal! This seems clearly useful, including the justification for needing server-side changes (for the ability to aggregate over dimensions and time ranges). As @beorn7 said it's definitely a huge change that would touch everything. @beorn7 just went through the same thing with the new native histogram metric type, all the way from instrumentation and bit-level details in the TSDB to new PromQL functionality, so he will be in a great position to judge how much work this really would be. But generally I'm not opposed if this finds a few interested implementors and owners.

Would base64 make sense, instead of hex encoding, for the text format?

@bernot-dev Potentially, unless there's anything intelligible for debugging that a human could read out of the hex values, but I doubt it. In any case, I think the exact text format encoding is a design decision that can be finalized at the very end. For native histograms, a new protobuf format was introduced for more efficient encoding, and only now are people thinking about also adding native histograms directly to the text format. This new metric type (iff we decide to go for it) could also use the protobuf format first.

The hyperloglog registers array (I think some of the other projects have taken to calling them hyperloglog sketches) isn't useful for human consumption. I guess if you wanted something useful to appear in the text format you could add the cardinality count in a comment (just keep in mind that cardinalities aren't scrapable, because you can't merge set lengths together, but they might be helpful for a human trying to debug by visiting the scrape endpoint in a browser).

BigQuery docs have some good examples: https://cloud.google.com/bigquery/docs/reference/standard-sql/hll_functions

Sorry for the long delay, the dev-summit finally got to discussing this.

Part of the reason why this was on the backburner for so long is that it is really hard to come to a firm conclusion here.

The feature looks really neat, but nobody among the (certainly non-representative) attendees of the dev-summit could claim that they always wanted that. (Compare that to native histograms, which is a feature that at least some of us had desired for a long time. It still took very long to implement (first thoughts 2015, design doc 2021, first experimental release 2022, stable release maybe late 2024), because we knew it would require changes throughout the stack, and there were so many other tasks to do that promised a quicker return on similarly desired features and improvements.)

The feature might still be super useful. It might be one of those features where you only realize that you were missing them once you tried them.

However, while it doesn't need to change every piece of the stack (like native histograms), it still touches most of them (instrumentation, exposition format, TSDB, PromQL). Especially the TSDB part is quite critical here. After adding native histograms, we found ourselves with a lot of switch statements sprinkled all over the code to treat the various sample types differently. This approach clearly doesn't scale beyond what we have. Adding yet another sample type would require a pretty deep redesign of the code structure (and to be fair, it would be good to do that even if we don't add a new sample type anytime soon). Along a similar line, I would even argue that PromQL itself needs to become more type-aware in general, and even more so if we want to add yet another metric type.

so there are a lot of "ifs" to this proposal:

Given the effort required, there needs to be a pretty compelling story why this will be generally useful.
There are essentially prerequisites to improve the Prometheus codebase to be ready for it (see above).
Both those prerequisites as well as the feature itself require a concerted and significant effort of a bunch of long-term committed people across the ecosystem. Currently, the project's primary problem is to rally enough experienced developers to incorporate the many more contained contributions we get and to stabilize the features we already have. With the team we have now, it's totally up in the air when enough people will have enough cycles for an involved feature like this.
Obviously, the best would be if new contributors joined the project that are interested in this. However, since this is so involved and all over the place, it cannot be implemented as a drive-by contribution. Those new people needed to get up to speed with major parts of the Prometheus codebase first, so again it would take some time, even if those people were available right now.

The usual next step at this stage would be to write a design doc over in https://github.com/prometheus/proposals . However, that's already quite a work investment, and we would not recommend it until the are developers willing to commit to the project. Otherwise, even an approved design doc might just sit there for years because nobody has the time to implement it.

The good news is that we didn't find any conceptual blockers at the dev-summit. From what we understood, it should fit into the Prometheus data and execution model. We will leave this issue open so that we don't forget about it (and to give people the opportunity to discover it and pick it up).

prometheus / prometheus