jjneely / statsrelay

A Golang consistent hashing proxy for statsd
MIT License
56 stars 19 forks source link

Question: statsrelay dropping packet?. #25

Open cjagus opened 4 years ago

cjagus commented 4 years ago

I have started testing statsrelay in our envioment by using statsd repeater. Onething I noticed is difference in metrics recieved in graphite vs statsd proxy.

image statsd repeater -> statsrelay -> statsd > graphite ^ [currently using one statsrelay and statsd]

And the difference is huge when we add more statsd backends. The same graphs works fine when I replace statsrelay with statsd. [statsd repeater -> statsd > graphite] Any thoughts on this @jjneely @szibis

jjneely commented 4 years ago

What are you measuring, exactly here? Make sure you are counting received metrics and not received packets. Some implementations don't make that distinction and statsrelay tries to pack UDP packets has much as it can.

There are, frankly, a lot of ways we could be leaking UDP packets. Remember, UDP doesn't guarantee delivery, and the StatsD design aims to collect a statistically significant sample of the data points rather than accounting for each and every metric end to end.

One of the reasons I wrote this was because the node implementation of Etsy's StatsD is really quite bad at dropping packets. You might want to look at running an implementation that's, uhh, more robust like Statsite.

https://github.com/statsite/statsite

Ok, let's figure out where you are dropping packets. Look at /proc/net/udp or /proc/net/udp6 on each of the machines in your setup. You'll see a row for each open UDP port the kernel has setup and listening on. One column is drops which counts the number of packets that the kernel ring buffer has dropped because the application (statsd/statsrelay/etc) wasn't able to read off the ring buffer fast enough to keep up with incoming traffic. That will most likely identify where the leaking is in your stack. Fixing is a matter of tuning where the leak is.

cjagus commented 4 years ago

Thanks for your response, In our env, we have statsd installed on all machines and aggregate locally and send to Graphite cluster, most of the applications are autoscaling so we don't need per instance metrics. Currently I'm forwarding a single application[consists of 10-20 ec2 machines] metrics using statsd repeater, so the throghput is not that hight[10k-30k per min],

So if you check this graph for API 2xx[application metrics]

image statsd repeater -> statsrelay -> statsd > graphite ^ [currently using one statsrelay and statsd] there is a difference in metric recieved via statsrelay and directly forwarding to graphite.

And if I stop the statsdrelay and directly forward[statsd repeater -> statsd > graphite] I don't see this drift in the graph, also I don't see any drops in /proc/net/udp or /proc/net/udp6 [had increased the syctl before based] @jjneely

jjneely commented 4 years ago

I'd agree that the traffic you have here should be low enough to work even in un-tuned environments.

What expressions are you graphing in the Grafana graphs?

How are you running Statsrelay? What's the script, arguments, options, etc that you are giving Statsrelay?

cjagus commented 4 years ago

Graphite expressions are pretty basic

Eg : alias(sumSeries(app.webapp.*.timers.apigateway.people__client.store.count), 'graphite')

statsrelay startup script [/etc/init/statsrelay.conf]:

description "Statsrelay"
start on (local-filesystems and net-device-up IFACE!=lo)
stop on [!12345]

limit nofile 1048576 1048576
oom score -1
respawn
respawn limit 10 5

exec /opt/statsd/packages/statsrelay --port 7125 --bind 10.1.10.92 --prefix statsrelay-proxy --sendproto="TCP"\ [tried with default UDP aswell]
     127.0.0.1:9125 \

statsd config :

{
  address: "0.0.0.0",
  mgmt_address: "0.0.0.0",
  mgmt_port: "9126",
  dumpMessages: false,
  flushInterval: 60000,
  graphitePort: 2003,
  graphiteHost: "graphite",
  port: "9125",
  server: './servers/tcp',
  backends: [ "./backends/graphite" ],
  prefixStats: "statsd_0",
  deleteCounters: true,
  deleteGauges: true,
  deleteIdleStats: true,
  percentThreshold: [90, 99],
  graphite: {
    legacyNamespace: false,
    globalPrefix: "app.statsd.statsd-1"
  }
}

@jjneely

jjneely commented 4 years ago
exec /opt/statsd/packages/statsrelay --port 7125 --bind 10.1.10.92 --prefix statsrelay-proxy --sendproto="TCP"\
     127.0.0.1:9125 \

StatsD binds to 0.0.0.0, but you are binding statsrelay to a specific IP address. I'm wondering if you are perhaps missing packets from a local version of the application here?

alias(sumSeries(app.webapp.*.timers.apigateway.people__client.store.count), 'graphite')

What would be helpful is to look at the metrics reported by StatsD and Statsrelay itself and see if the daemons are encountering the same number of metrics. That will give us a better idea about where the leaking is happening. There should be a statsrelay.statsProcessed that StatsRelay will emit as a Counter that will report how many statsd metrics/samples it is receiving.

Likewise, Statsd will have a similar counter that it generates internally and emits that counts the number of metrics it has seen. (And I forget what the metric name is, its been so long since I've used Etsy's StatsD.)

These counters over time would be what I would compare to fully understand where the leak is.

cjagus commented 4 years ago

Attaching > image

statsProcessed.count vs statsd.metrics_received.count

Also made changes to statsrelay to bind on 0.0.0.0.

jjneely commented 4 years ago

CJ,

Those numbers suggest that you are dropping 0.4% of packets. Which is a LOT better than the previous numbers suggesting around 10% drop. My usual goal in a very high throughput StatsD setup was to keep UDP and metric drop below 1%.

Have you tried running StatsRelay in verbose mode and see if it is dropping statsd metrics that do not parse correctly?