grobian / carbon-c-relay

Enhanced C implementation of Carbon relay, aggregator and rewriter
Apache License 2.0
380 stars 107 forks source link

how to copy a fraction of metrics to a test cluster? #331

Closed mwtzzz-zz closed 6 years ago

mwtzzz-zz commented 6 years ago

we've got a production cluster of carbon-c-relay hosts that forward metrics using fnv1a_ch hash to a dozen backend hosts running carbon-cache. I'd like to spin up a small test cluster of two or three backend hosts and I'd like each of these to receive the same portion of production metrics that each of the production boxes receives. In other words, I want each test host to receive 1/12th of the metrics, just like each prod host does. Is this possible? How would I configure the relay to do this?

mwtzzz-zz commented 6 years ago

The only way I can think of to do this is in the relay.conf define a test cluster of 12 hosts, but actually spin up only two or three. The relay will see the rest as offline and not send metrics to them. Does this seem like a valid approach?

azhiltsov commented 6 years ago

This approach should work, you can use discard socket for absent nodes, or another instances of carbon-c-relay In order to discard metrics immediately instead of holding them in memory output buffers

grobian commented 6 years ago

What I used to do, is to copy the metrics from the production cluster hosts. This may sound a bit odd, and can only be done if you have sufficient outbound network capacity. The carbon-c-relay on the production cluster hosts can be configured to not just send to its backend carbon-cache.py, but also forward to another host(s). I used this extensively to capacity test new configuration etc. by just adding more cluster hosts that forward their metrics to a single box, but also to replicate a destination host as preparatory step before migrating it.

deniszh commented 6 years ago

I think @mwtzzz asking for a solution for opposite problem - not "many clusters to single host" but "fraction of big cluster to single host"....

mwtzzz-zz commented 6 years ago

Yes I want "fraction of big cluster to a single host" so that this host gets the same workload that a regular production host gets. I want to benchmark disk performance on this host using different disk types and filesystems.

@azhiltsov What do you mean "discard socket for absent nodes", can you give me an example?

grobian commented 6 years ago
producer -> carbon-c-relay (relay) -> carbon-c-relay + carbon-cache.py (production host)

You may or may not have carbon-c-relay running on the production host. I assume you do, because carbon-cache.py doesn't parallelise much (at least it didn't). In that scenario changing one or more -- any amount you want, to forward to a test host achieves the effect you're after.

I think there were ideas and plans at some point to allow "mirroring" destination hosts from clusters but none of this got implemented, because of difficulties to express this.

Think of stuff like send 20 percent to test-cluster in a match rule, this would make sense to ramp up traffic somehow, but doesn't meet your "identical traffic" requirement.

mwtzzz-zz commented 6 years ago

It seems the relay only provides the option to send metrics to a host and port, but no option to send to a "/dev/null" where the metric can be dropped.

So, maybe I can configure different ports on my test backend host and configure the service on those ports to do nothing other than blackhole everything it receives. If I do this, then I'm assuming my production relay layer memory output buffers won't get backed up, and my new backend host won't get overloaded.

it would look like this:

Production Relay cluster    ----->  Production backend cluster (13 hosts)
                             | -->  Test backend cluster (2 hosts) 

Where the test hosts receive a copy of everything but most of the metrics are blackholed.

mwtzzz-zz commented 6 years ago

@grobian Ah... you're saying configure the backend relay (not the middle layer relay) to send to a couple test hosts.... This makes sense, why didn't I think of that? Just configure two of the backend hosts to copy everything they get. Yes, this should achieve what I want.

grobian commented 6 years ago

Yes, if you have that option, I'd do it, as you'd likely have to modify your cluster definitions in order to allow getting exact replicas anyway.

mwtzzz-zz commented 6 years ago

Great, thanks everyone for your help. I'll post results here when I have them.