vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.41k stars 1.51k forks source link

Change default merge_strategies in reduce transform #18239

Open ilinas opened 1 year ago

ilinas commented 1 year ago

A note for the community

Use Cases

Our applications log in the following format:

{
  "message": "Accepted connection from user bob on server charlie",
  "messageTemplate": "Accepted connection from user {userId} on server {serverId}",
  "userId": "bob",
  "serverId": "charlie"
}
{
  "message": "Accepted connection from user alice on server echo",
  "messageTemplate": "Accepted connection from user {userId} on server {serverId}",
  "userId": "alice",
  "serverId": "echo"
}

There are a lot of these messages, and most of the time they are not very useful, therefore we would like to reduce them by messageTemplate as follows:

{
  "messageTemplate": "Accepted connection from user {userId} on server {serverId}",
  "message": [
    "Accepted connection from user alice on server echo",
    "Accepted connection from user bob on server charlie"
  ],
  "userId": [ "alice", "bob" ],
  "serverId": [ "charlie", "echo" ]
}

It works for message, because it is a known property, and you can specify merge_strategies for those. However userId and serverId are properties of a specific message. Other messages will have different parameters, and they can be anything. By default only the first value is preserved.

Attempted Solutions

There does not seem to be a way to change default merge_strategies for undefined fields.

Using wildcards in field names also seems to be unsupported, so merge_strategies cannot be defined even if they had some common format like var_userId, var_serverId, var_*.

Proposal

We would like to be able to control default merge_strategies for undefined fields.

[transforms.message_reduce.merge_strategies]
message = "flat_unique"

[transforms.message_reduce.default_merge_strategies]
string = "flat_unique"
numeric = "sum"

Being able to use wildcards for property names would also be useful.

References

No response

Version

0.30.0

Shakahs commented 8 months ago

I have a similar use case, my log stream creates multiple partial entries per second and I need to roll them up into a single object emitted at an interval:

  readsb_merged:
    type: reduce
    inputs:
      - readsb_raw
    group_by:
      - hex
    flush_period_ms: 1000
    merge_strategies:
      "*": "retain"