grobian / carbon-c-relay

Enhanced C implementation of Carbon relay, aggregator and rewriter
Apache License 2.0
380 stars 107 forks source link

Aggregator is too costly #173

Closed liyichao closed 6 months ago

liyichao commented 8 years ago

It seems one aggregate rule starts a aggregator thread. Should it be done more efficiently?

grobian commented 8 years ago

what is the cost of one thread?

liyichao commented 8 years ago

We just doubt it causes too much context switch. In our server, there is only statsd server and carbon-c-relay, but the dstat shows:

----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
 13   5  73   1   0   7| 166B  160k|   0     0 |   0     0 |  19k  273k
  8   5  85   1   0   2|   0   264k|  83M 3288k|   0     0 | 127k  241k
  6   5  87   0   0   2|   0    96k|  80M 4573k|   0     0 | 129k  248k

the csw is too high compared to other servers. The context switch is currently not causing any problem, it is just unexpected one rule will start a thread.

How about this thread model:

  1. (frontend) one thread for accept tcp metrics (epoll + nonblocking), it will read every 64KB for example, and parse, put all the metric into a queue A for rule processing
  2. (frontend) one thread for udp, just listen packet forever
  3. (worker) one thread for all the logic of match, rewrite, aggregate; accepts metric from frontend's output queue (queue A), processing all the logic and sends result to backend queue
  4. (backend) one thread for each cluster that has metrics sending to it (epoll + nonblocking for multiplexing each backend servers)

Currently, the thread 3 may be a bottleneck if there are too many rules. For example, aggregate https://github.com/grobian/carbon-c-relay/issues/168. We may solve it by this:

Aggregator will also cache metric name -> aggregation rule so it need only be matched once.

grobian commented 8 years ago

Does csw include thread switches? A thread is not a process.

Your thread model is close to how it is currently implemented with -w1. The aggregator thread is necessary to "expire" the metrics. This is unrelated to the input, therefore a separate thread.

Sharing the aggregator work is very hard, because the load from aggregations doesn't come from multiple aggregation rules, but from thousands or more expansions (computes) from a single aggregation rule.

liyichao commented 8 years ago

If the regex matching result is cached, aggregator computing is lightweight, there only remains +-*/, so what's the load of thousands of +-*/?

liyichao commented 8 years ago

Every rule starts a thread just seems scary. When the rules are many, then the communication and synchronization will become the bottleneck, because every metric has to be sent to all aggregators and the result of aggregators have to be sent to the frontend.

Currently, the bottleneck may be the regex match, I guess the reason for many aggregators is spreading the match work along many cpu. But that is not needed, if we cache the target bucket of every metric name, the work is only done for the first come of a new metric name. The problem may be just carbon-c-relay matching regex every time, besides this, I can not think of any costly operation that needs more than one cpu.

The cpu should only be in input protocol parsing, which can not be reduced anyway. All regex matching result can be cached, including the result of rewrite, match.

grobian commented 8 years ago

I think there is a misunderstanding. The relay has a static number of threads, which is main + workers + servers + submission-server + aggregator. So, no matter how many aggregations you have, there's only one expiry aggregator thread. This one in my opinion should go, and the workers should do the job, but for that, expiry should be able to be done in parallel. Due to the way aggregations currently work, and the perfectly logical imbalance that happens there, the workers very quickly suffer from lock-contention when they deal with aggregations (because the most popular case is 1 aggregation rule resulting in 10K+ individual aggregations).