vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
18.16k stars 1.6k forks source link

configuration: ability to mark a component as `final`/`terminal` in the topology #10115

Open hhromic opened 3 years ago

hhromic commented 3 years ago

Current Vector Version

vector 0.18.0 (x86_64-unknown-linux-gnu c77e085 2021-11-18)

Use-cases

With the newly introduced ability to route dropped events in Vector πŸŽ‰ , a very common use-case for it (in our opinion) is to better instrumentalise our Vector pipelines.

For example, consider the following pipeline configuration where dropped messages are centrally logged by a logger transform, thus keeping the real-work transforms cleaner:

sources:
  stdin:
    type: stdin
    decoding:
      codec: bytes

transforms:
  parse-syslog:
    type: remap
    inputs:
      - stdin
    drop_on_error: true
    reroute_dropped: true
    source: |-
      . = parse_syslog!(.message)
  parse-key-value:
    type: remap
    inputs:
      - parse-syslog
    drop_on_error: true
    reroute_dropped: true
    source: |-
      . = parse_key_value!(.message)

  dropped-logger:
    type: remap
    inputs:
      - parse-syslog.dropped
      - parse-key-value.dropped
    drop_on_error: true
    source: |-
      host = string!(.host)
      message = string!(.message)
      component_id = string!(.metadata.dropped.component_id)
      log("component '" + component_id + "' dropped message from <" + host + ">: " + message, level: "warn")

sinks:
  console:
    type: console
    inputs:
      - parse-key-value
    encoding: json

However, the above topology has one little caveat: the dropped-logger transform does not go anywhere else because it is just there for calling the log() function and that is. This leads to Vector giving the following warning:

2021-11-19T13:52:23.397844Z  WARN vector::config::loading: Transform "dropped-logger" has no consumers

Attempted Solutions

The only attempted solution for now is to simply ignore the warning from Vector. However, we believe that Vector can be smarter here and only emit that warning when a component truly needs a consumer down the line. See the proposal section below.

Proposal

To help Vector determine if a component should have consumers or not, we propose the addition of a new boolean configuration key named is_final or is_terminal (to give some examples). The purpose of this configuration is to mark a component either as final/terminal or not in the topology. For example:

  dropped-logger:
    type: remap
    inputs:
      - parse-syslog.dropped
      - parse-key-value.dropped
    drop_on_error: true
    is_final: true
    is_terminal: true  # alternative
    source: |-
      host = string!(.host)
      message = string!(.message)
      component_id = string!(.metadata.dropped.component_id)
      log("component '" + component_id + "' dropped message from <" + host + ">: " + message, level: "warn")

In this way, if a component is marked final/terminal, then Vector can safely ignore the fact that it doesn't have any consumers and avoid emitting the warning.

The above proposal would be quite general-purpose functionality for Vector, however another solution for the particular use-case described, is to introduce an internal_logs sink where dropped messages could be routed to. This sink could have all the typical configurations such as which fields you want to include in the log.

spencergilbert commented 3 years ago

@hhromic sidenoting that this could also be done by having a console sink as your "logger" πŸ˜„

hhromic commented 3 years ago

@spencergilbert I did think of that workaround but as you surely will agree, it is quite ugly and hacky :) You loose all of the niceties of the built-in logging framework such as timestamping, coloring, level-filtering and rate limiting.

I am aware that my proposal/enhancement request might be a bit too "over the top". After all, it is just to eliminate a warning. But I also believe that Vector would be more correct and the DAG more expressive (in terms of what the pipeline architect wanted to convey) with such marker feature.

For example, if a transform is marked as final/terminal, and anyway it is used as an input for another transform/sink, Vector could also warn that maybe that is not what the programmer wanted to do.

spencergilbert commented 3 years ago

@spencergilbert I did think of that workaround but as you surely will agree, it is quite ugly and hacky :) You loose all of the niceties of the built-in logging framework such as timestamping, coloring, level-filtering and rate limiting.

Agreed! Definitely worth consideration

jszwedko commented 3 years ago

Yeah, this is an interesting use-case.

As another work-around for now, you can send to a blackhole sink which would alleviate the warning on start-up.

hhromic commented 3 years ago

Ah yes didn't think of using a blackhole sink. I think I still prefer a one-time warning on start-up than spinning up more resources permanently just to silence it hehe.

I kept toying the idea of an internal_logs sink in my head, it could work like this:

sources:
  stdin:
    type: stdin
    decoding:
      codec: bytes

transforms:
  parse-syslog:
    type: remap
    inputs:
      - stdin
    drop_on_error: true
    reroute_dropped: true
    source: |-
      . = parse_syslog!(.message)
  parse-key-value:
    type: remap
    inputs:
      - parse-syslog
    drop_on_error: true
    reroute_dropped: true
    source: |-
      . = parse_key_value!(.message)

sinks:
  console:
    type: console
    inputs:
      - parse-key-value
    encoding: json
  dropped-logger:
    type: internal_logs
    inputs:
      - parse-syslog.dropped
      - parse-key-value.dropped
    encoding:
      only_fields:
        - host
        - message
        - metadata.dropped
    level: warn

In this case we don't need the dropped-logger tranform at all and the dropped streams can go directly there. The level configuration could even be a dynamic configuration from a field as well.

The only problem I can think to be careful about is a potential feedback-loop that could be caused if a user wires a path from the internal_logs source into this sink.

Hope you like these ideas and of course, there is no rush! I understand there are other priorities for the project πŸ‘

jszwedko commented 2 years ago

Thanks for the thoughts! That is definitely a neat idea.

πŸ€” You can create a similar feedback loop right by using log() in a remap transform that is fed from internal_logs.

spencergilbert commented 2 years ago

Thanks for the thoughts! That is definitely a neat idea.

πŸ€” You can create a similar feedback loop right by using log() in a remap transform that is fed from internal_logs.

πŸ™Š don't give away our secrets!