vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.39k stars 1.51k forks source link

Improve Vector's throughput for UDP protocols #8518

Open jszwedko opened 3 years ago

jszwedko commented 3 years ago

Currently we process UDP packets serially in the syslog source when mode udp is used:

https://github.com/timberio/vector/blob/410fac0b7fbb0c7361056105ba362f7ee6c112ca/src/sources/syslog.rs#L422-L444

It seems like we should be able to improve throughput introducing concurrency here given that processing the packet does involve some decoding and parsing work.

This is also true for the socket source:

https://github.com/vectordotdev/vector/blob/801ee2178b8e30d2695a131185e09b11c7623ffd/src/sources/socket/udp.rs#L92

Perhaps one worker per "connection" similar to the tcp mode where we could use a source ip/port pair to partition them; or just by having a fixed (configurable?) number of workers processing packets.

hhromic commented 2 years ago

Perhaps one worker per "connection" similar to the tcp mode where we could use a source ip/port pair to partition them; or just by having a fixed (configurable?) number of workers processing packets.

The idea of using the (source ip/port) tuple for marking a "connection" or "session" is indeed common in many applications. However, from my experience in my company where we deal with a big number of observability data feeds, many of those systems send data using a single source ip/port that rarely rotates. So, parallelising by this tuple won't do much in these cases unfortuntely.

Mentioning that for your consideration on the design 👍

davidpellcb commented 1 month ago

We're interested in potentially replacing the dogstatsd portion of Datadog Agent with a Vector agent dedicated to receiving statsd applicatino metrics, due to performance issues we've had with dogstatsd under load. It would be nice to see if this provided better performance once optimized!