Closed berghauz closed 6 years ago
try to spit metrics from the source over multiple connections to the relay
This reminds me about https://github.com/grobian/carbon-c-relay/issues/270 But you are using TCP...
what is the queuesize of the relay? what is the number of stalls? what is the cpu usage?
what is the queuesize of the relay?
No matter what is size, it start overflow right after queue overloads, the difference is only of size of queue, if it small (default, for example), it start overflow sooner, if it large, couple of millions it start overflow later.
what is the number of stalls?
trying with 4 and 0, no difference. Also, metric producer seems have not respect this option.
what is the cpu usage?
-q 2500000 -b 250000 -w10 -L 4
btw: gaps on graphs is spotting dropped metrics from relay itself
Right, so it seems as if the relay cannot send the metrics fast enough. Since your batch is pretty big, I'd say that it's purely a write efficiency problem. I'd like to try something regarding, would you be able to test the code from a branch or patch?
I'd like to try something regarding, would you be able to test the code from a branch or patch?
Sure! Thank you.
dunno if this works, but here's the patch https://gist.github.com/grobian/3b85f74f038415cfb90bfcf178a20b54
Sad, but no changes.
Perhaps i shouldn't trying to "pull an owl over the globe" and pass this bunch of metrics by short path without relay, because it's historical data gathered in plain files from blackbox for years, and turn relay only for production metric gathering, which have much much much less per/sec rate and should meet even default queue settings.
That looks indeed exactly the same, except no gaps in the self-metrics.
One last thing I meant to try, is to use fwrite_unlocked() instead of fwrite() in sockwrite and similar using fflush_unlocked in sockflush. That should reduce most contention on outbound IO.
You say the producer is something that just reads from a file, and writes to the relay (or clickhouse)? What is the ingest speed of clickhouse in that scenario? I'm wondering if it's a matter of the relay not stalling enough in this case.
It is exactly the same, except i turn on connected points in grafana :) in original it looks like
I think i found bottleneck and sad to admit, but i'm misled you previously, pointing that graphouse can handle around 750kpoints/sec in my setup. I have rebench setup (10 threads) and it's shows clearly - 303880 points/s (the graphs only confirm this). In contravention with carbon-c-relay (4 threads) ingest speed was, as you see on graphs - 500kpoints/sec in single socket(awesome!). Anyway, just to answer you question, full chain follows: self written go procuder -> graphouse -> clickhouse.
@azhiltsov was right in their suggestion. Thanks guys and sorry for my inattention!
So, the relay doesn't stall graphouse enough? Or should it allow you to use parallel connections to graphouse to increase offload?
I think @berghauz experiencing one of those? @grobian : #315 #216
Cpu appears reasonable, but multiple destinations could help here.
Well, i have bring up another two graphouse nodes, three in total and use this:
cluster clickhouse
any_of
10.7.8.133:2003 proto tcp
10.7.8.134:2003 proto tcp
10.7.8.135:2003 proto tcp
;
Result is - no drops:
And cpu usage a bit grows on carbon-c-relay node.
that looks much happier :)
Hi, guys! I faced problem, in case when metrics uploaded consistently (500-750k/sec, storing backend is graphouse and it is not a bottleneck, because successfully keep this load without relay) whenever queue size and batch was chosen it will overflows sooner or later, so the question is - how to proper configure relay to keep such consistent load without points drops? If it's make sense - metrics source is a single process, running with 4 threads, carbon-c-relay running with all defaults but 10 workers. Config is minimal: