Statsrelay fault tolerance

uber-archive / statsrelay

A consistent-hashing relay for statsd and carbon metrics

Other

101 stars 28 forks source link

Statsrelay fault tolerance #64

Open fabeschan opened 8 years ago

fabeschan commented 8 years ago

Hey guys,

I was under the impression that if a statsd host goes down, the statsrelay would divert metrics that would otherwise be routed to the dead statsd host -- and this takes advantage of the consistent hashring. But it seems like this doesn't actually happen. Is this intended, or a bug?

mtrienis commented 8 years ago

JeremyGrosser commented 8 years ago

This is intended... If statsrelay diverted metrics to a different statsd instances, then you'd potentially have two statsd instances writing the same key,timestamp tuple to graphite with different values, neither of which would include all data for that key. Statsrelay's use case is really more focused around performance, where a single statsd/statsite process cannot keep up with the volume of metrics you're sending.

fabeschan commented 8 years ago

Thank you for the explanation @JeremyGrosser... if this is the intention, what would you recommend should be done in the case of failed nodes?

JeremyGrosser commented 8 years ago

You might want to take a look at the Lyft fork (https://github.com/lyft/statsrelay), it supports sending metrics to multiple backends simultaneously... This way you could run two sets of carbon servers for redundancy.

InfluxDB is worth a look too... It would replace your carbon servers for persistence and has it's own replication/sharding implementation. I had quite a few issues last time I tried it, but I've heard it's gotten more stable since then.