grobian / carbon-c-relay

Enhanced C implementation of Carbon relay, aggregator and rewriter
Apache License 2.0
380 stars 107 forks source link

Any way to have multiple threads per carbon_ch destination without duplicating service? #443

Closed percygrunwald closed 7 months ago

percygrunwald commented 2 years ago

From what I can see, each destination in a carbon_ch output block gets a single thread. We have a configuration like this:

cluster storage_machines
  carbon_ch
  1.2.3.4:2203
  1.2.3.5:2203
  ;
match *
  send to storage_machines
  stop
  ;

And the thread for the first output is maxed out:

image

This results in queues rising until the limit is reached, then metrics start to get dropped.

Is there a way that we could get multiple threads for a single destination without changing the consistent hash?

I could do something like this:

cluster storage_machines
  carbon_ch
  1.2.3.4:2203=a
  1.2.3.4:2204=b
  1.2.3.5:2203=a
  1.2.3.5:2203=b
  ;

And have the reverse proxy on each storage host "merge" the results, but given that there aren't really 2 instances on the storage machine, the consistent hash view of tools like carbonate or buckytools will not be consistent with the view of the relay.

Another option would be to spin up additional carbon-c-relay services and load balance output between them:

cluster storage_writers
  any_of
  127.0.0.1:2004
  127.0.0.1:2005
  127.0.0.1:2006
  127.0.0.1:2007
  ;
match *
  send to storage_writers
  stop
  ;

Where each entry in storage_writers is another instance of carbon-c-relay with the original configuration shown at the top. This seems like a really long walk to get 4 threads writing to 1.2.3.4:2203. I guess another way would be to to have multiple instances of carbon-c-relay and reverse proxy all incoming metrics to them. This eliminates one instance of carbon-c-relay, but we're still multiplying instances of carbon-c-relay just to get multiple threads per destination. It would be nice to be able to specify the number of workers per destination like we can specify the number of dispatchers with -w.

Thank you for any suggestions.

grobian commented 2 years ago

Hmmm, by design I never included the option to use parallel delivery, I need to understand what is causing your load, if it is due to computing the hash or that it is related to locking from the main input queue (in which case threading won't help).

There is no way to do it right now, but since the code already has provisions to share an output queue, a global option like you mention may not be too difficult.

If I'd create an experimental patch would you be able to test if it has the desired effect?

percygrunwald commented 2 years ago

Hi @grobian, thank you for your reply.

If I'd create an experimental patch would you be able to test if it has the desired effect?

Absolutely, I would happily do that.

I realize I have also missed details about the release version and platform, I will confirm these and provide some more metrics tomorrow.

percygrunwald commented 2 years ago

Current env:

carbon-c-relay version: 3.4 (we should test with a newer release, I didn't realize we were on such an old release)
OS: Debian Jessie (kernel version 4.19)
CPU: Intel(R) Xeon(R) CPU E5-2699A v4 @ 2.40GHz

As a baseline looking at our current data, it looks like we are using are around 2.6-3.3 us of walltime per metric for these carbon_ch outputs.

destinations.*.wallTime_us / destinations.*.sent
image

Based on the maximum value, we should be be able to send around 300k metrics/CPU second, which is consistent with what we observed last week.

image

I will try to test with a newer release tomorrow and report back if there is any change to the performance.

grobian commented 2 years ago

ok, I'll wait for that, if I have some cycles before that I'll see if I can prepare anything anyway.

grobian commented 7 months ago

any news here? :)

percygrunwald commented 7 months ago

@grobian sorry we didn't check with a newer version, but I can't remember why. We have added a number of additional storage servers, which decreases the number of metrics per back end to a level that isn't a concern any more. We are trying to deprecate graphite, so I don't know if there will an upgrade in the future to compare with. Thank you for your responses and sorry we couldn't give you any additional data. Happy if you want to close this issue.

grobian commented 7 months ago

thanks for coming back on this, I guess graphite is on the way out on more places