vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
18.13k stars 1.6k forks source link

Vector drops large # of UDP packets in statsd source without warning #15583

Open derekhuizhang opened 1 year ago

derekhuizhang commented 1 year ago

A note for the community

Problem

For testing, I'm running Vector in sandbox namespace with stateless-aggregator mode helm chart as a Deployment, config shown below

I also run a firehose service (kubectl apply -f <file_name> with this config saved as a file):

apiVersion: v1
kind: Service
metadata:
  name: firehose
  namespace: sandbox
spec:
  selector:
    app.kubernetes.io/instance: vector-agent
    app.kubernetes.io/name: vector
  type: ClusterIP
  ports:
  - protocol: UDP
    port: 8126
    targetPort: 8126

then I created a new pod (kubectl run firehose -it --image=debian -n sandbox) and cp'ed and ran a statsd-firehose on it to generate sample metrics: https://github.com/derekhuizhang/statsd-firehose

./statsd-firehose -countcount 10 -distcount 10 -gaugecount 10 -statsd firehose:8126 (this sends 10 counts, dists, gauges every second to the service)

Then I execed into the Vector pod and found that if we increase the counts/dists/gauges past a certain level, we get a very high # of UDP packet drops

/ # cat /proc/1/net/udp
   sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode ref pointer drops
 7523: 00000000:1FBE 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 214256796 2 0000000000000000 305796

If we do ./statsd-firehose -countcount 10 -distcount 10 -gaugecount 10 -statsd firehose:8126 for instance, the # of drops stays at 0 If we do ./statsd-firehose -countcount 20000 -distcount 10 -gaugecount 10 -statsd firehose:8126 for instance, the # of drops increases by ~100 every second If we do ./statsd-firehose -countcount 100000 -distcount 10 -gaugecount 10 -statsd firehose:8126, the # of drops increases rapidly every second

The high # of drops indicates lots of metrics are being dropped. We ran this test bc we noticed a large # of metrics were being dropped in our prod envs, so we're pretty sure these metrics just aren't being processed by Vector. We've tried this with different sinks and transforms, so this isn't a sink/transform issue.

There is plenty of memory and CPU allocation available the pod, so we don't think there's any back-propagation happening

These dropped UDP packets are not surfaced in vector internal metrics, so there's no way to tell this is happening except by exec-ing into the Vector pod

We also noticed if we run nc -ulp 8126 > /dev/null 2>&1 and ran ./statsd-firehose -countcount 100000 -distcount 10 -gaugecount 10 -statsd firehose:8126 we don't see the massive # of UDP packet drops if we run netstat -u -s (which we do when running with Vector), so we don't think it's an inherent OS limitation, but happy to be proven wrong.

Is there anything that we can do to stop Vector from dropping these UDP packets?

Configuration

api:
  address: 0.0.0.0:8686
  enabled: true
  playground: false
data_dir: /var/lib/vector
expire_metrics_secs: 120
sinks:
  blackhole:
    type: blackhole
    inputs:
    - statsd_metrics
sources:
  statsd_metrics:
    address: 0.0.0.0:8126
    mode: udp
    type: statsd

Version

0.26.0

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

jszwedko commented 1 year ago

HI @derekhuizhang !

Thanks for this report.

This seems likely to be because Vector cannot process the incoming UDP packets fast enough. Unfortunately Vector also can't know that since the packets are dropped by the OS.

We have a couple of issues to improve performance of UDP sources:

Also relevant:

I think this issue is covered by the others so I'll close it as a duplicate, but feel free to subscribe and add additional details to those others!

There was some discussion around this in discord.

jszwedko commented 1 year ago

Actually, I see you weren't able to observe dropped packets. I'll re-open this for investigation.

derekhuizhang commented 1 year ago

Are there any available workarounds?

Why can't Vector know how many packets were dropped by OS? cat /proc/1/net/udp shows me how many packets were dropped

jszwedko commented 1 year ago

I revise my previous statement 😅 Apologies. I've been scattered this morning. Reading your issue again it sounds like you are observing packet drops with high volume. I misread the bit about using nc. This seems to me to be covered by the issues I linked above to improve Vector's performance around UDP sources. Vector will never be as fast as just using nc to receive and drop packets but it can definitely be better.

The simplest workaround is to horizontally scale Vector by either running additional instances are running multiple statsd sources in a signal instance, on different ports.

Why can't Vector know how many packets were dropped by OS? cat /proc/1/net/udp shows me how many packets were dropped

That's a fair point. I meant the source couldn't know as part of normal processing since the packets are dropped before Vector has seen them. Vector could certainly read it from /proc though. I created https://github.com/vectordotdev/vector/issues/15586 to track that.

derekhuizhang commented 1 year ago
  1. Is there a way to add additional horizontal Vector pods easily using the helm chart, either in stateless-aggregator or aggregator modes? I see the HAProxy configs but I've never worked with them before
  2. Do you think increasing the # of statsd sources via different UDP endpoints would increase throughput, vs having them all through the single port in one source like in my config?
  3. The main issue is that we don't know when the throughput will exceed the permissible level for one UDP port. I think a better idea would be to have another service in front of Vector that can take in all the UDP traffic and convert it to a different format like TCP, which has higher throughput into the statsd source. A message queue like RabbitMQ comes to mind. Do you think this makes sense? Is there a better source besides statsd that can easily handle higher throughput?
jszwedko commented 1 year ago
  1. Is there a way to add additional horizontal Vector pods easily using the helm chart, either in stateless-aggregator or aggregator modes? I see the HAProxy configs but I've never worked with them before

You can scale up the number of replicas. You will want to put a UDP load balancer in front. The helm chart has an HAProxy image included, but we generally recommend bringing your own load balancer (e.g. if you are in AWS, using an NLB). The HAProxy configuration is provided as a sort of "quick start".

  1. Do you think increasing the # of statsd sources via different UDP endpoints would increase throughput, vs having them all through the single port in one source like in my config?

Right, you could have multiple statsd sources defined on different ports. The sender will need to balance across them though.

  1. The main issue is that we don't know when the throughput will exceed the permissible level for one UDP port. I think a better idea would be to have another service in front of Vector that can take in all the UDP traffic and convert it to a different format like TCP, which has higher throughput into the statsd source. A message queue like RabbitMQ comes to mind. Do you think this makes sense? Is there a better source besides statsd that can easily handle higher throughput?

That is definitely an option to run a sidecar. Vector's TCP handling is generally more performant as it can balance across multiple incoming TCP connections. I would recommend this sidecar establishing multiple TCP connections to Vector (maybe scaling that up or down automatically depending on the queue of messages to be sent?).

Using a completely different medium, e.g. RabbitMQ or NATS, is definitely another option.

derekhuizhang commented 1 year ago

Interesting, curious if there are any other sources which have been load tested with very high throughputs (100k+/sec metrics)

jszwedko commented 1 year ago

Interesting, curious if there are any other sources which have been load tested with very high throughputs (100k+/sec metrics)

We have performance tests covering some cases. You can see them here: https://github.com/vectordotdev/vector/tree/master/regression/cases. We measure bytes rather than events, but you should be able to extrapolate. We don't have any UDP-based ones just yet (https://github.com/vectordotdev/vector/issues/12215).