Understanding Slow Connections

grafana / carbon-relay-ng

Fast carbon relay+aggregator with admin interfaces for making changes online - production ready

Other

468 stars 151 forks source link

Understanding Slow Connections #104

Open maxwax opened 8 years ago

maxwax commented 8 years ago

I'm building a Graphite cluster to ingest > 700,000 metrics per minute. It's working fairly well, but right now I'm seeing dropped metrics via 'action_is_drop.reason_is_slow_conn' stats.

What in carbon-relay-ng's perspective is slow? And how can I investigate further?

replicator -> sharder (node a, localhost) -> carbon-cache -> whisper
           -> sharder (node b, ethernet) -> carbon-cache -> whisper

In my architecture, on Node A, a 'replicator' carbon-relay-ng process receives from 500+ hosts and does nothing but forward to another carbon-relay-ng on Node A (via localhost) and an identical carbon-relay-ng on node B (via gigabit ethernet). Secondary carbon-relay-ng processes, 'sharders', then distribute the metrics to 16 carbon-cache processes.

I see 50k metrics dropped on the replicator-to-sharder localhost a-to-a connection and 100k metrics dropped on the replicator-to-sharder ethernet a-to-b connection.

I don't suspect network performance to be an issue, especially considering NodeA to NodeA is on the localhost device.

If I turn off replication from Node A to Node B, I still see issues with the metrics being relayed between the replicator carbon-relay-ng and the sharder carbon-relay-ng on Node A.

Thanks!

Dieterbe commented 8 years ago

see https://github.com/graphite-ng/carbon-relay-ng/blob/master/destination.go#L229-L239 and https://github.com/graphite-ng/carbon-relay-ng/blob/master/conn.go#L172-L237

i.e. when the connection can't take in new bufs on its In channel because the In channel is full and it's busy flushing(or shutting down).

currently the in buffer size is hardcoded to 30k. see https://github.com/graphite-ng/carbon-relay-ng/blob/master/conn.go#L15 if your connection can flush faster than what it takes to receive 30k values, you should be good. you may want to play with the flush period and the buffer size, but i would advise against "just setting the buffer very large" because that only masks/delays issues. are you using the included grafana dashboard? that should help seeing a lot of the perf metrics.

maxwax commented 8 years ago

Just a followup:

I did get the carbon-relay-ng dashboard working and that was helpful, so thanks for reminding me to ensure that resource is in place. I'd like to know more about the dynamics between the 'conn flush size', 'conn metrics in buffer' and 'conn flush durations' metrics shown on the included dash, so I'm hoping to have more time to read the code and experiment in the future.

After trying a variety of other things like switching one node from RAID6 to RAID10, I slowly started walking the 'conn_in_buffer' value from its default of 30k to 60k, then 90k, then 120k, and finally to 200k. Each time I raised it the number of 'dropped metrics due to slow connection' decreased. At 120k, the number dropped was down from 100k to less than twenty. I raised it up to 200k thinking this would be safe as well as give us room to grow or handle bursts of incoming metrics.

I've now got two nodes processing a very large number of metrics: We're averaging about 892k per minute. Recently I shutdown the carbon-caches and left the carbon-relay-ng's up and running, spooling metrics to disk. When I restarted the carbon-caches, for a brief minute I saw 7 million metrics hitting the caches via 16 caches doing about 500k per process.

This has made me very impressed with the combination of carbon-relay-ng and carbon-caches, and I really appreciate your work and support!

amallem commented 8 years ago

Hi, I am trying to setup a graphite architecture using the carbon-relay-ng as my relay. There is an haproxy which sends load to 4 relays and each relay hash the metrics to 5 graphite instances(1 ng-relay + 2 carbon-caches) running. The load on the haproxy is 2 mil metrics per minute and I see that my top level relays are dropping metrics with the reason "reason_is_slow_conn". I also bumped up the hard-coded buffer size from 30k to 300k and I still see it happening. Is there any other configuration I am missing which will reduce the drops? The current drops shown by the graph are nearly 1 Mil sometimes. Thanks.

My logs show the following statement : conn.Read returned EOF -> conn is closed. closing conn explicitly Please let me know if it helps.

Dieterbe commented 8 years ago

sorry, not much i can do to help right now (super busy with work). see my explanation above. perhaps something's wrong with the code. maybe somebody else can have a closer look

maxwax commented 8 years ago

@amallem Can you clarify your comments with more details?

What exactly is your 'top level relays?
What kind of hardware are your running on?
Describe how software is matched with hardware -- all one one box doing all this, or separated on many boxes.. if so how?

My production system has been working very well for many months regularly taking in a lot of metrics. Right now it looks the front-end carbon-relay-ng ("replicator") process in my nodes are taking in about about 1.79 mil metrics on one node in a very bursty fashion and about 800k metrics on the second node in a much more sustained fashion.

The replicators simply duplicate the metric stream to a second set of carbon-relay-ng processes on each node and there the sharder reports about 2 mil incoming metrics. At first, it sounds like it should be more given the replicator numbers, but the burstyness of carbon-relay-ng can be a little challenging to figure out.

Anyway, with sixteen carbon-cache daemons processing between 90k and 130k on each node, the sustained rate of incoming metrics to the caches is about 2mil per minute.

So thats 2 mil metrics incoming on just two nodes. Keep that in mind as you work your situation and know this software is very good and capable of handling a lot.

My carbon-relay-ngs are set to 200k and that seems to be a sweet spot for us.

One thing I might have changed and not written about: I've set the GOLANG MAXPROCS value to 'max_procs = 8' in the carbon-relay-ng config files. If I'm right this lets each carbon-relay-ng process (one 'replicator' and one 'sharder' on each physical node) use up to 8 OS threads. You may want to include this parameter in your experimentation to see if that gives you better or worse results.

My hardware is Oracle/Sun X4-2 nodes with 2x 8 real core/16 logical Intel Xeon processors, 256G of memory. One node has a fast FusionIO 1.2TB PCIe SSD card, and the other is using a 24 SAS HD drive RAID 10 array with an LSI RAID controller.

The complexity of our setups is why I'd like to know more about your setup details and your hardware.

I'm also on vacation right now, so please be aware any help I can provide you may delayed.

Good luck!

amallem commented 8 years ago

Thanks for the quick response guys....here are your requested details.

## Architecture I have 1 haproxy talking to 4 top level relays in a round-robin fashion. Each machine here runs one carbon-relay-ng. Now each of these relays perform a consistent hash across 4 other graphite instances. Each graphite instance is one ng-relay sending metrics to 2 underlying carbon caches. SO essentially its haproxy at layer 1, 4 top level relays at layer 2 and 5 graphite instances(with ng-relay) at layer 3. All the machines are VM's with 4VCPU's and 16GB RAM. I do not have the exact configuration of the underlying hypervisor.

## Software Config

Top Level Relays : I have set buffer limit to 300k per destination. max_procs is 2. spooling enabled. consistent hashing enabled. metrics validation disabled.

Graphite instances (ng-Relay + 2 carbon caches) : ng-Relay is configured same as the top level relays with consistent hashing across the 2 carbon-caches . Each carbon-cache has MAX_CREATES_MINUTE = 5000, UPDATES_PER_SEC = 700.

I am sending 2 mil metrics per minute to the haproxy at the rate of 200k different metrics every 5 secs.

## Problems Faced

As soon as I start sending metrics I see the buffer pool per destination immediately filling(300k) in some of the top level relays and metrics starts dropping with "reason_is_slow_conn". Even when the buffer pool cools down later to less than 50k the metrics are continuously dropped.
After some time i observe that one of the destination for one of the top level relays is always having the buffer pool full at 300k.

I did think of adding more carbon caches in my graphite instances but since the drops are happening at the top level relays I dont see a point as to how they would help. I shall try your suggestion of increasing the max_procs value to 3 (I only have 4 VCPU to experiment with) and see if that improves/worsens anything. Also can you tell me if you have overrided the default flush interval in your setup? Please feel free to suggest any variations in my config or where I am going wrong. I have spent a lot of time with this and I dont have any more ideas to tune the config. Thanks for your help in advance.

maxwax commented 8 years ago

Sorry for the delay, I'm home vacation. (Fantastic, but exhausting with so much activity). I hope you've had some luck while I was away, but in case you haven't..

Some comments and questions to check my reading and also trigger some thoughts with you.

* \ I don't think I've changed the flush interval. Dieter's comments to me suggested that increasing the buffer size to crazy high numbers aren't a good way of approaching problems. I think I found that, for me, > 200k buffer sizes was not better and perhaps worse than 200k.**

* \ The problem you're seeing is between the top-level (layer 2) relays and the bottom-level relays (layer 3) relays right?

The layer 3 relays aren't reporting problems pushing any metrics they receive to the carbon-cache daemons?
You're using carbon-relay-ng everywhere, not just at layer 3, right?

* \ The HAProxy front-end is doing round-robin to the top-level carbon-relay-ngs and there are four top-level (layer 1) relays.. Does this means that each top level relay is handling 1/4 of the metrics?**

It would probably be good to look at the self-reported stats for carbon-relay-ng top-level instances and see what your load across VMs is in reality.
If HAProxy is doing round robin based on socket connections, you might end up with some metric publishers on one socket pushing through more metrics than a publisher on another client's socket?
The key here, for me, is to ensure that four top-level relays are handling 1/4 of the load.. or, find out how much more they are handling in worst case situations.

* * I think you should try setting max_procs to 4 on the top-level (layer 1) VMs. If those relay VMs' primary and only job is relaying metrics, don't be afraid to give it all the CPU resources you've got.*

Four vCPUS may not be enough for your workload so recording observations with max_procs at 2, 3, and 4 would be useful.
Can you, at least temporarily, get access to a larger VM and experiment with scaling up? See how well 6 and 8 CPUs would handle your load?

* \ Are all these VMs on the same host hardware, or multiple hardware instances?**

In my situation, my top-level relay's are sending to other relays

* \ Does your hypervisor environment enforce any rate limiting on the virtual ethernet interfaces? Can you verify the speed between top-level relays and level 3 relays with something like iperf?**

https://linuxaria.com/article/tool-command-line-bandwidth-linux

Looking forward to hearing from you.

maxwax commented 8 years ago

Just for reference, my carbon-cache settings. Only showing the parameters you mentioned above:

MAX_UPDATES_PER_SECOND = 800 MAX_CREATES_PER_MINUTE = inf

plus this might be worth trying?

USE_FLOW_CONTROL = True

and some others

WHISPER_AUTOFLUSH = False LOG_UPDATES = False

amallem commented 8 years ago

Thank you Max those were great insights. I did some diggings on the same lines you have suggested while u were on vacation and I think too that the problem might be in the way the load balancer is working. My layer 2 and layer 3 relays essentially face a similar load and the layer 3 works fine while the layer 2 is always screaming. Theoretically each of the layer 2 relay receive a load of 500k per minute while the layer 3 relay receives 400k per minute. So this made me start looking into the load balancer distribution. I dont think the VM's being on the same hypervisor is the problem atleast right now as the drops are happening at the layer 2 relays and not near the carbon caches. Having said that I agree, this is one thing to always keep a tab on. To have more control and visibility over the data flowing and simulate a close to production environment I have decided to deploy this architecture on the pre-production environment and perform a test run with a sample load. The script which currently simulates the load for me on the test environment does it in a very bursty fashion (200k per thread with 10 threads at an interval of 5 secs) and I strongly believe this is where the problem is happening as the load balancer must be sending the whole burst to a single relay instead of spreading the burst across all the relays. Thank you for all your suggestions and I shall soon update you with the results I achieve on my new setup.