grobian / carbon-c-relay

Enhanced C implementation of Carbon relay, aggregator and rewriter
Apache License 2.0
380 stars 107 forks source link

How properly chose queue and batch #338

Closed berghauz closed 6 years ago

berghauz commented 6 years ago

Hi, guys! I faced problem, in case when metrics uploaded consistently (500-750k/sec, storing backend is graphouse and it is not a bottleneck, because successfully keep this load without relay) whenever queue size and batch was chosen it will overflows sooner or later, so the question is - how to proper configure relay to keep such consistent load without points drops? If it's make sense - metrics source is a single process, running with 4 threads, carbon-c-relay running with all defaults but 10 workers. Config is minimal:

cluster clickhouse
    forward
    10.7.8.133:2003 proto tcp
    ;

statistics
    submit every 10 seconds
    prefix with apps.carbon.relays.kyc
    send to clickhouse
    ;

match *
    send to clickhouse
    stop
    ;
azhiltsov commented 6 years ago

try to spit metrics from the source over multiple connections to the relay

Farfaday commented 6 years ago

This reminds me about https://github.com/grobian/carbon-c-relay/issues/270 But you are using TCP...

grobian commented 6 years ago

what is the queuesize of the relay? what is the number of stalls? what is the cpu usage?

berghauz commented 6 years ago

what is the queuesize of the relay?

No matter what is size, it start overflow right after queue overloads, the difference is only of size of queue, if it small (default, for example), it start overflow sooner, if it large, couple of millions it start overflow later.

what is the number of stalls?

trying with 4 and 0, no difference. Also, metric producer seems have not respect this option.

what is the cpu usage?

-q 2500000 -b 250000 -w10 -L 4 image image image

image btw: gaps on graphs is spotting dropped metrics from relay itself

grobian commented 6 years ago

Right, so it seems as if the relay cannot send the metrics fast enough. Since your batch is pretty big, I'd say that it's purely a write efficiency problem. I'd like to try something regarding, would you be able to test the code from a branch or patch?

berghauz commented 6 years ago

I'd like to try something regarding, would you be able to test the code from a branch or patch?

Sure! Thank you.

grobian commented 6 years ago

dunno if this works, but here's the patch https://gist.github.com/grobian/3b85f74f038415cfb90bfcf178a20b54

berghauz commented 6 years ago

Sad, but no changes.

image

Perhaps i shouldn't trying to "pull an owl over the globe" and pass this bunch of metrics by short path without relay, because it's historical data gathered in plain files from blackbox for years, and turn relay only for production metric gathering, which have much much much less per/sec rate and should meet even default queue settings.

grobian commented 6 years ago

That looks indeed exactly the same, except no gaps in the self-metrics.

One last thing I meant to try, is to use fwrite_unlocked() instead of fwrite() in sockwrite and similar using fflush_unlocked in sockflush. That should reduce most contention on outbound IO.

You say the producer is something that just reads from a file, and writes to the relay (or clickhouse)? What is the ingest speed of clickhouse in that scenario? I'm wondering if it's a matter of the relay not stalling enough in this case.

berghauz commented 6 years ago

It is exactly the same, except i turn on connected points in grafana :) in original it looks like

image

I think i found bottleneck and sad to admit, but i'm misled you previously, pointing that graphouse can handle around 750kpoints/sec in my setup. I have rebench setup (10 threads) and it's shows clearly - 303880 points/s (the graphs only confirm this). In contravention with carbon-c-relay (4 threads) ingest speed was, as you see on graphs - 500kpoints/sec in single socket(awesome!). Anyway, just to answer you question, full chain follows: self written go procuder -> graphouse -> clickhouse.

@azhiltsov was right in their suggestion. Thanks guys and sorry for my inattention!

grobian commented 6 years ago

So, the relay doesn't stall graphouse enough? Or should it allow you to use parallel connections to graphouse to increase offload?

azhiltsov commented 6 years ago

I think @berghauz experiencing one of those? @grobian : #315 #216

grobian commented 6 years ago

Cpu appears reasonable, but multiple destinations could help here.

berghauz commented 6 years ago

Well, i have bring up another two graphouse nodes, three in total and use this:

cluster clickhouse
    any_of
    10.7.8.133:2003 proto tcp
    10.7.8.134:2003 proto tcp
    10.7.8.135:2003 proto tcp
    ;

Result is - no drops: image

And cpu usage a bit grows on carbon-c-relay node.

image

grobian commented 6 years ago

that looks much happier :)