grafana / carbon-relay-ng

Fast carbon relay+aggregator with admin interfaces for making changes online - production ready
Other
467 stars 151 forks source link

Plans on RR (round robin) implementation? #474

Open akamensky opened 3 years ago

akamensky commented 3 years ago

Round Ronbin has been on README for who knows how long (according to git blame since 2014 28dc6e03fd18f2e5ac0145d557b6df569946c291). However since then it is "not implemented".

Graphite (and go-carbon/carbonapi) work amazingly well with data written completely in RR fashion (as in same metric points being written in RR way to multiple destination hosts), it would be great if carbon-relay-ng actually would have RR functionality to spread metrics across many hosts.

In our use-case it is something we are looking to do right now as we reached capacity of SSD write throughput on single carbon server, and managing metrics assignments (using matching mechanism) is going to be one hell of manual work on 1000s of already existing metrics.

Dieterbe commented 3 years ago

and managing metrics assignments (using matching mechanism) is going to be one hell of manual work on 1000s of already existing metrics.

this is why consistentHashing exists. consistent assignment without any manual configuration.

TBH i never really understood why anyone would want "true" round robin. my understanding of that is that it effectively routes metrics randomly, so all endpoints would see all metric names but the values for each metric would be sprayed around randomly.

IIRC that's why i left it in the readme, to see if anyone would ask about it. Very few people have asked for it. (you and one person in https://github.com/grafana/carbon-relay-ng/issues/23#issuecomment-357794864 to which i replied it's implemented, and i don't remember why i said that. maybe i was referring to consistent hashing, because i think sometimes those terms are used interchangeably)

akamensky commented 3 years ago

I'd rather consider true RR here as this would mean that IO of go-carbon would be spread more evenly. As I understand it (that wasn't explained in the docs) consistent hashing means that specific prefix would always be sent to the same destination, which leaves room for IO imbalance.

In our case, as we reached the SSD throughout limit, the disk IO is the bottleneck, hence the concern of actually addressing IO scaling.

It does indeed means same metric would be "sprayed" across multiple hosts. I have done some testing on very junky setup (using nginx as dumb UDP LB) and I've seen no problems with that. Use of carbonapi allows easily querying of multiple hosts containing the same metric and merging it into one response. I also seen no performance issues there, arguably performance would be even better as the read bottleneck in most cases is also IO.

Dieterbe commented 3 years ago

consistent hashing means that specific prefix would always be sent to the same destination, which leaves room for IO imbalance.

it's not by prefix, but by hash of the full metricname. it generally results in a very even spread across all backends. (though there is 1 edge case where it doesn't, there is a bug about that - I think maybe that was #335 ).

If anyone opens a PR with RR support, i would review (and hopefully merge) it.