thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.07k stars 2.09k forks source link

Receiver: improve uniformity of load distribution across replicas #3793

Open squat opened 3 years ago

squat commented 3 years ago

https://github.com/thanos-io/thanos/issues/3488 describes an issue from a Thanos user that we have also seen in production at Red Hat. At its core, the load distribution for the ingestion of series in a Thanos receive hashring is not guaranteed to be uniform, and in some cases can be very uneven.

Background

The fact that one replica sees consistently higher load than the others is most likely due to some inherent lumpiness in the time series being sent. Thanos decides which replica should ingest data by hashing the name and label-value pairs of a time series and picking a corresponding replica from the ring to handle this hash. It seems that the data being sent simply has more time series that hash to one replica. There is no guarantee that data will be distributed uniformly across replicas, however, for a good hash function, the greater the number of time series and the more random the names and labels of those time series, statistically, we should coverage towards a uniform distribution.

Example

Unfortunately, there is no bug here. Take for example the case where a Prometheus server only produces a single time series and remote writes a million samples/second. We would rightfully expect super high load on a single hashring replica. This is essentially an extreme version of the case we see here.

Proposal

I suggest we improve the randomness of replica selection and thus drive load distribution towards uniformity even in the degenerate single-time-series case. One way to achieve this would be by adding a random value to the hash that changes for every request. It would be important that replicas forward this metadata along with the request so that once it is set, the random value is fixed and replicas can agree on who should ultimately ingest a sample. This would help solve load distribution for lumpy data.

WDTY @thanos-io/thanos-maintainers

Note: this proposal is adapted from my comment in https://github.com/thanos-io/thanos/issues/3488#issuecomment-778141503

bwplotka commented 3 years ago

I like it, but would be nice to understand tradeoffs and challenges. One of those would be an increased share-ability of data across receive nodes: We will distribute more having potentially hundreds of TSDB instances per receiver in case of 100 tenants, right?

squat commented 3 years ago

I don't think there is really any negative trade-off here. We already expect every receive node in a hashring for a tenant to have a TSDB open for that tenant. This is because it is almost 0% likely that 0 time series would get hashed to one node for any serious load other than degenerate cases like the extreme example posed above, which are be normal. So, we would not expect the number of TSDBs per receiver to change at all. What would change would be the predictability of where a time series would end up. This could make unit testing trickier, but could be avoided by passing a hard-coded value for the "random" value.

squat commented 3 years ago

One drawback that does occur to me is that when querying for a single time series, queries will need to load data from all receivers in the hashring. Normally, all samples for a time series would be on one single replica, or three replicas would contain all samples of replication-factor=3. Now, all replicas will likely contain some samples from a single time series though none of the replicas will hold the entire time series. This means more work joining the series at query time but only before data is uploaded and deduplicated in object storage.

brancz commented 3 years ago

I also don't see the per TSDB overhead changing at all.

We have a rather naive distribution hash right now (which btw is also not safe from zone failure, another aspect that I think we should think about addressing). I feel like experimenting with a variety of hashes might be a higher win. Would it be possible for you to record the label-sets of your active series and use that dataset as a basis for experimentation before we think about introducing randomness?

squat commented 3 years ago

By experimenting with hashes do you mean experimenting with different hashing functions? Or with adding more data to the hash? I see two orthogonal issues here:

  1. the hash's distribution; maybe xxhash is fast but not well distributed compared to other hashes?
  2. the underlying lumpiness of the data; how many different time series are we sending? And specifically how many time series with heavy load are we sending?

In the first case, if we assume the xxhash has a decent distribution, then non-uniform load has is most likely due to lack of scale. The larger the set size, the less likely it is to see high variability in the distribution of load. Adding extra static data to a hash to try to shift around some non-uniform load will be just a likely to produce bad results as not adding the data.

The second case, which is what this issue is concerned with, is about addressing cases where the load caused by a handful time series is disproportionately greater than the load of all other 10000+ time series. In these cases, even if we have a perfectly uniform distribution of time series, where every replica gets exactly the same number of series, we can end up with very unequal distribution of load if the few heavy series end up on a single machine. This is not unlikely because the set size of high load series is small, and thus high deviation from the mean is not so unlikely.

One trick we can do is simulate having an infinite number of time series and thus improving the likelihood of an equal distribution of series across replicas. Essentially, instead of rolling the dice one time for each series, we roll it every single time we get a request containing that series. I think this is better than finding the perfect hashing function, i.e. optimizing case 1, as it addresses the case of having too few time series to get a good distribution, solves the case of having lots of time series but only a few heavy series, and even solves extreme case of only sending one single time series to the cluster.

brancz commented 3 years ago

Can you explain what a single time-series that is 10000x more expensive than others looks like? 10000x more samples per second, or what kind of other characteristics? Even if outliers like that exist it sounds to me that they should be breaching some limit and Thanos simply doesn't accept them (we don't have limiting mechanisms like that today but I think it would make sense to have them).

I think I am starting to understand what @bwplotka meant though, as rolling the dice every time will essentially cause the series to be active on every single node, so we would end up having the series metadata overhead on every node.

squat commented 3 years ago

yes exactly

10000x more samples per second

this is one of the examples I was describing. Thanos has no abuse limitation mechanisms, but even if it did, this is not necessarily abuse. What if it's legitimate traffic? Essentially, thanos receive can in many cases have bad load balancing.

Yes, bwplotka comment is correct but I think it's not directly related to this issue: if a hashring handles 100 tenants, then the each replica in the hashring will likely have 100 TSDBs, but that is already the case today independent of this proposal. For any significant load, we will already be in this scenario. There are ways to fix this, e.g. tiered time series routing, but that is an orthogonal problem to this issue. Even if the we end up with only three nodes in a hashring of 100 nodes handling the time series for a tenant, there is still the possibility of bad load balancing across those nodes.

What IS a problem is that samples for a single time series will be distributed across all nodes, meaning increased overhead. I also tried to capture that in my earlier comment: https://github.com/thanos-io/thanos/issues/3793#issuecomment-778615327

brancz commented 3 years ago

Actually I was thinking about the 10000x more samples per second case, and since the chunks get mmapped immediately when 120 samples are full, this shouldn't actually have the effect described. Do you have a dataset that can be shared that reproduces the effect you are aiming to solve? I think if we had that we could be much more systematic about solving it than hypothesizing about problems and solutions.

squat commented 3 years ago

An example of a user report can be found in #3488, as I mentioned at the top of the issue. Note that this is by no means just a question of TSDB load but importantly also a question of distribution of load in the HTTP and request processing layers in Thanos receive.

brancz commented 3 years ago

I'm not actually seeing data validating the theory, other than memory usage. Your theory could be right, but it could just as well be a memory leak or something entirely different. The number of active series and samples per second on each receiver would be helpful to support the theory plus profiles of both processes. It could also be 10s of MB of text in label-values (which has happened at SoundCloud in the past for example).

brancz commented 3 years ago

@heartTorres would it be possible for you to supply us with this additional data?

stale[bot] commented 3 years ago

Hello πŸ‘‹ Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! πŸ€— If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

kakkoyun commented 3 years ago

Still valid.

bwplotka commented 3 years ago

Connected to https://github.com/thanos-io/thanos/pull/3390/files (finding better sharding)

stale[bot] commented 3 years ago

Hello πŸ‘‹ Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! πŸ€— If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] commented 3 years ago

Hello πŸ‘‹ Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! πŸ€— If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] commented 2 years ago

Closing for now as promised, let us know if you need this to be reopened! πŸ€—

yeya24 commented 2 years ago

Still valid.

stale[bot] commented 2 years ago

Hello πŸ‘‹ Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! πŸ€— If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] commented 2 years ago

Closing for now as promised, let us know if you need this to be reopened! πŸ€—

stale[bot] commented 2 years ago

Hello πŸ‘‹ Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! πŸ€— If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] commented 2 years ago

Hello πŸ‘‹ Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! πŸ€— If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] commented 1 year ago

Hello πŸ‘‹ Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! πŸ€— If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.