Open derekhuizhang opened 1 year ago
HI @derekhuizhang !
Thanks for this report.
This seems likely to be because Vector cannot process the incoming UDP packets fast enough. Unfortunately Vector also can't know that since the packets are dropped by the OS.
We have a couple of issues to improve performance of UDP sources:
Also relevant:
I think this issue is covered by the others so I'll close it as a duplicate, but feel free to subscribe and add additional details to those others!
There was some discussion around this in discord.
Actually, I see you weren't able to observe dropped packets. I'll re-open this for investigation.
Are there any available workarounds?
Why can't Vector know how many packets were dropped by OS?
cat /proc/1/net/udp
shows me how many packets were dropped
I revise my previous statement 😅 Apologies. I've been scattered this morning. Reading your issue again it sounds like you are observing packet drops with high volume. I misread the bit about using nc
. This seems to me to be covered by the issues I linked above to improve Vector's performance around UDP sources. Vector will never be as fast as just using nc
to receive and drop packets but it can definitely be better.
The simplest workaround is to horizontally scale Vector by either running additional instances are running multiple statsd
sources in a signal instance, on different ports.
Why can't Vector know how many packets were dropped by OS?
cat /proc/1/net/udp
shows me how many packets were dropped
That's a fair point. I meant the source couldn't know as part of normal processing since the packets are dropped before Vector has seen them. Vector could certainly read it from /proc
though. I created https://github.com/vectordotdev/vector/issues/15586 to track that.
- Is there a way to add additional horizontal Vector pods easily using the helm chart, either in stateless-aggregator or aggregator modes? I see the HAProxy configs but I've never worked with them before
You can scale up the number of replicas. You will want to put a UDP load balancer in front. The helm chart has an HAProxy image included, but we generally recommend bringing your own load balancer (e.g. if you are in AWS, using an NLB). The HAProxy configuration is provided as a sort of "quick start".
- Do you think increasing the # of statsd sources via different UDP endpoints would increase throughput, vs having them all through the single port in one source like in my config?
Right, you could have multiple statsd
sources defined on different ports. The sender will need to balance across them though.
- The main issue is that we don't know when the throughput will exceed the permissible level for one UDP port. I think a better idea would be to have another service in front of Vector that can take in all the UDP traffic and convert it to a different format like TCP, which has higher throughput into the statsd source. A message queue like RabbitMQ comes to mind. Do you think this makes sense? Is there a better source besides statsd that can easily handle higher throughput?
That is definitely an option to run a sidecar. Vector's TCP handling is generally more performant as it can balance across multiple incoming TCP connections. I would recommend this sidecar establishing multiple TCP connections to Vector (maybe scaling that up or down automatically depending on the queue of messages to be sent?).
Using a completely different medium, e.g. RabbitMQ or NATS, is definitely another option.
Interesting, curious if there are any other sources which have been load tested with very high throughputs (100k+/sec metrics)
Interesting, curious if there are any other sources which have been load tested with very high throughputs (100k+/sec metrics)
We have performance tests covering some cases. You can see them here: https://github.com/vectordotdev/vector/tree/master/regression/cases. We measure bytes rather than events, but you should be able to extrapolate. We don't have any UDP-based ones just yet (https://github.com/vectordotdev/vector/issues/12215).
A note for the community
Problem
For testing, I'm running Vector in
sandbox
namespace with stateless-aggregator mode helm chart as a Deployment, config shown belowI also run a firehose service (
kubectl apply -f <file_name>
with this config saved as a file):then I created a new pod (
kubectl run firehose -it --image=debian -n sandbox
) and cp'ed and ran a statsd-firehose on it to generate sample metrics: https://github.com/derekhuizhang/statsd-firehose./statsd-firehose -countcount 10 -distcount 10 -gaugecount 10 -statsd firehose:8126
(this sends 10 counts, dists, gauges every second to the service)Then I execed into the Vector pod and found that if we increase the counts/dists/gauges past a certain level, we get a very high # of UDP packet drops
If we do
./statsd-firehose -countcount 10 -distcount 10 -gaugecount 10 -statsd firehose:8126
for instance, the # of drops stays at 0 If we do./statsd-firehose -countcount 20000 -distcount 10 -gaugecount 10 -statsd firehose:8126
for instance, the # of drops increases by ~100 every second If we do./statsd-firehose -countcount 100000 -distcount 10 -gaugecount 10 -statsd firehose:8126
, the # of drops increases rapidly every secondThe high # of drops indicates lots of metrics are being dropped. We ran this test bc we noticed a large # of metrics were being dropped in our prod envs, so we're pretty sure these metrics just aren't being processed by Vector. We've tried this with different sinks and transforms, so this isn't a sink/transform issue.
There is plenty of memory and CPU allocation available the pod, so we don't think there's any back-propagation happening
These dropped UDP packets are not surfaced in vector internal metrics, so there's no way to tell this is happening except by exec-ing into the Vector pod
We also noticed if we run
nc -ulp 8126 > /dev/null 2>&1
and ran./statsd-firehose -countcount 100000 -distcount 10 -gaugecount 10 -statsd firehose:8126
we don't see the massive # of UDP packet drops if we runnetstat -u -s
(which we do when running with Vector), so we don't think it's an inherent OS limitation, but happy to be proven wrong.Is there anything that we can do to stop Vector from dropping these UDP packets?
Configuration
Version
0.26.0
Debug Output
No response
Example Data
No response
Additional Context
No response
References
No response