vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.78k stars 1.57k forks source link

Add configuration for limiting `reduce` transform #9498

Open jszwedko opened 3 years ago

jszwedko commented 3 years ago

A user in Slack noted that they would like to limit the reduce transform further by providing absolute limits to cap a set of messages that do not end up matching the multiline conditions. In their case, they saw Vector try to flush a 2 GB event during shutdown.

Ideas:

Either, or both, of these would help limit the reduce transform.

Another idea is to have these events go to an "errors" output stream rather than flow through the standard output stream for the transform following #3939

matt-simons commented 2 years ago

Could expire_after_max_bytes be implemented by a VRL function for size?

i.e. a more efficient version of ends_when = "length(encode_json(.)) > 1000000"

thomasbarton commented 2 years ago

Just an observer of this ticket.

Would the VRL code be run inside the reduce function? If so that would probably work.

matt-simons commented 2 years ago

Yeah, that's what I was thinking

kevinburkesegment commented 1 year ago

In the absence of this feature, any suggestions on what we can do to implement it? ends_when: len(encode_json(.)) > threshold only checks the size of the most recent event, not the batch, and the expire_after and flush_period checks both seem to reset when they receive a new event.

Is there even a way to see how many events have been aggregated so far in an ends_when check?

jches commented 1 year ago

A workaround that I used for a while was to add a remap transform on the output of any reduce transform, which split up any batches that were over a maximum size. That pipeline was feeding in to a kafka sink, which rejects messages that are too large, so the absence of this feature led to some data loss.

Depending on how your reduce is merging fields, I think a splitting remap transform may or may not be practical. Ours was appending event data into a single array field, so splitting it was straightforward. Still, the unbounded time/expiration based grouping is less than ideal for this use case. We ended up running a custom build of vector with this patch applied: https://github.com/vectordotdev/vector/pull/14817