vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.46k stars 1.53k forks source link

sources: add customisable `source_label` field to emitted events #11544

Open hhromic opened 2 years ago

hhromic commented 2 years ago

A note for the community

Use Cases

We need to identify from which configured source an event comes from for labelling and/or routing. In our main use case, we have two socket sources configured on different listening ports and we need to add a label-field indicating from which port the event came from. This can extend to more than two in some other use cases.

In other use cases, we need to route events depending on which source they come from.

Attempted Solutions

At the moment, Vector does not provide any field to identify the source from which an event comes from. It only provides a source_type source-related field that is not useful. For example:

{"host":"172.17.0.1","message":"hello world","source_type":"socket","timestamp":"2022-02-23T22:39:46.334288400Z"}

However, we can add remap transforms wired to each source to label the events and then merge. For example:

sources:
  socket-8080:
    type: socket
    mode: tcp
    address: 0.0.0.0:8080
  socket-8081:
    type: socket
    mode: tcp
    address: 0.0.0.0:8081

transforms:
  label-8080:
    type: remap
    inputs:
      - socket-8080
    source: '.label = "FROM-8080"'
  label-8081:
    type: remap
    inputs:
      - socket-8081
    source: '.label = "FROM-8081"'

sinks:
  console:
    type: console
    inputs:
      - label-*
    encoding: json

Which produces the following output when sending sample data to each port:

{"host":"172.17.0.1","label":"FROM-8080","message":"hello world","source_type":"socket","timestamp":"2022-02-23T22:47:03.000811600Z"}
{"host":"172.17.0.1","label":"FROM-8081","message":"hello world","source_type":"socket","timestamp":"2022-02-23T22:47:03.935660200Z"}

While this approach works, it has two drawbacks:

  1. It needs a rather dummy transform for each source, i.e. lot of boiler-plate.
  2. It degrades performance. In our tests we noticed a drop from 80K EPS to 76K EPS just by doing this labelling.

Proposal

Instead of the approach described above, we think that the simplest solution would be to add a new source_label field to the event, similar to the existing source_type field, containing a configurable label of the source that emitted the event. For example:

{"host":"172.17.0.1","message":"hello world","source_label":"FROM-8080","source_type":"socket","timestamp":"2022-02-23T22:47:03.000811600Z"}
{"host":"172.17.0.1","message":"hello world","source_label":"FROM-8081","source_type":"socket","timestamp":"2022-02-23T22:47:03.935660200Z"}

See https://github.com/vectordotdev/vector/issues/11544#issuecomment-1050709550 for an implementation idea.

This would make trivial to identify the source in remapping transforms together with other remappings necessary.

References

No response

Version

vector 0.20.0 (x86_64-unknown-linux-gnu 2a706a3 2022-02-11)

jszwedko commented 2 years ago

👍 that is a neat idea. I think, until we have a better situation for event metadata, that we'd want to introduce another schema field for this. We'd likely want to also make it opt-in so users aren't surprised by the new field when sending to sinks that have strict schemas.

jszwedko commented 2 years ago

As a separate issue, that we are working on aspects of, there really shouldn't have been such a performance regression adding the remap transform there.

hhromic commented 2 years ago

👍 that is a neat idea. I think, until we have a better situation for event metadata, that we'd want to introduce another schema field for this. We'd likely want to also make it opt-in so users aren't surprised by the new field when sending to sinks that have strict schemas.

That sounds like a good approach!

As a separate issue, that we are working on aspects of, there really shouldn't have been such a performance regression adding the remap transform there.

This is how I tested this. First, I created some random data for testing:

cat /dev/urandom | tr -dc 'a-z A-Z' | tr ' ' '\n' | head -n 5000000 > data.test

Then, the baseline configuration is this:

api:
  enabled: true

sources:
  logs:
    type: stdin

sinks:
  blackhole:
    type: blackhole
    inputs:
      - logs

Which I run like this with the testing data:

while true; do cat data.test; done | docker run --name vector --rm -i -v $PWD/vector.yaml:/vector.yaml:ro timberio/vector:0.20.0-distroless-libc -c /vector.yaml

On my computer, vector top reports an average of 88-90K EPS for the blackhole sink.

Now, the labelling configuration using remap is this:

api:
  enabled: true

sources:
  logs:
    type: stdin

transforms:
  label:
    type: remap
    inputs:
      - logs
    source: '.label = "SOME"'

sinks:
  blackhole:
    type: blackhole
    inputs:
      - label

Which on the same computer, vector top reports an average of 82-83K EPS for the blackhole sink.

jszwedko commented 2 years ago

Ah, yes, sorry, I should have made it clearer in my comment that there is definitely known overhead in adding a remap transform; it is just larger than we would like for it to be. We are making improvements there though. For example rewriting VRL's interpreter as a VM.

That particular case could likely also be aided by removing one clone that happens due to VRL being an expression-based language so that .label = "SOME" itself returns a cloned "SOME" that in this case is being discarded. @JeanMertz do you know if we have an issue tracking that? I can't find anything.

Thanks for the perf reproduction case though 😄

hhromic commented 2 years ago

Ah, yes, sorry, I should have made it clearer in my comment that there is definitely known overhead in adding a remap transform; it is just larger than we would like for it to be. We are making improvements there though. For example rewriting VRL's interpreter as a VM.

Ah yes, I have been looking forward to see that released!

Btw, I just remembered that there is an add_fields non-VRL transform that can be used as well:

sources:
  socket-8080:
    type: socket
    mode: tcp
    address: 0.0.0.0:8080
  socket-8081:
    type: socket
    mode: tcp
    address: 0.0.0.0:8081

transforms:
  label-8080:
    type: add_fields
    inputs:
      - socket-8080
    fields:
      source_id: socket-8080
  label-8081:
    type: add_fields
    inputs:
      - socket-8081
    fields:
      source_id: socket-8081

sinks:
  console:
    type: console
    inputs:
      - label-*
    encoding: json

While the boilerplate code is pretty much the same, at least performance doesn't degrade that much as with remap :)

I just tested this using the configuration in my last comment and got 86-88K EPS using add_labels. I do know that those transforms are getting deprecated and hence they are not documented anymore.

But for now I guess is good enough for us until the feature requested here manages to be implemented (if decided so).

hhromic commented 2 years ago

After developing our use-case more, I actually think that a better feature request would be the ability to add a customisable "label" to each event emitted by a source. So instead of a source_id field, a configurable source_label one for example. It could be provided at the source configuration like this:

sources:
  udp-socket:
    type: socket
    mode: udp
    address: 0.0.0.0:1514
    label: DATACENTER-1
  tcp-socket:
    type: socket
    mode: tcp
    address: 0.0.0.0:8080
    label: DATACENTER-2

If the label config is not provided (could be the default), then no source_label field is added and becomes opt-in.

The reason for this customisability is that we don't want to name our sources after the labels, which are placed inside the event to the next processing component instead.

jszwedko commented 2 years ago

👍 I wonder if we could just generalize this further and let you attach a VRL program to sources. It'd really just be syntax sugar over chaining a remap transform afterwards though. Granted, as you noted, we still need to improve the performance of remap over native code (it'll never be quite as fast, but can be closer).

hhromic commented 2 years ago

Hehe I wouldn't "over do it" either :) As you say, it would be syntax sugar and I don't think simple things like labelling the source warrants a remap interpreter. I would keep it simple as sources should be as performant as possible imho :)

smitsr72 commented 2 years ago

The labeling solution would help my use case too. Another option would be to always include the metadata that is already collected when aborted/asserted. If we could opt in on the metadata to be always put in the message schema.

    "metadata": {
        "dropped": {
            "component_id": "transform_name",
            "component_kind": "transform",
            "component_type": "remap",
            "message": "function call error for \"assert\" at (29:86): custom assert message",
            "reason": "error"
        }
eocron commented 1 month ago

Any progress on this? I'm currently trying to add common metric log_last_timestamp{source="$source$"} to make further alerting on failed synchronizations between file events and destination system events (elastic).

jszwedko commented 1 month ago

Nothing yet. The workaround is pretty straight-forward though: add a remap transform after each source you want to add a label.

eocron commented 1 month ago

I have a * mask. I grab all sources, so remap will not work for me, cause I don't know them beforehand.