redpanda-data / connect

Fancy stream processing made operationally mundane
https://docs.redpanda.com/redpanda-connect/about/
8.13k stars 831 forks source link

Configuration section for variables #1366

Open ppavlov39 opened 2 years ago

ppavlov39 commented 2 years ago

Hello! We use benthos in streaming mode and processing several data streams. Is there a way to set a variable (or something like that) that we can use to configure an input parameters in a stream? We can set some value in meta-information, but it can't be used in input before we got a first message from input.

As example, in mongodb input the most of config is common between streams, but collection name must be specific to every stream.

mihaitodor commented 2 years ago

Would environment variables help you? The config supports interpolation from env vars for fields.

ppavlov39 commented 2 years ago

Would environment variables help you? The config supports interpolation from env vars for fields.

Thanks for the answer.

Unfortunately no. We are already using environment variables to set some parameters. But now we need to make a choice which input benthos to use. Environment variables do this perfectly. But if we have multiple streams, we must prepare two input component configs for each streams. And if we have some differences in the output section, we also need to have two configurations for each case, because Benthos initializes input and output before it process any mappings in variables. Such config becomes very confusing.

If we could do some variable handling before initializing the input and output components, using environment variables, that would be great.

mihaitodor commented 2 years ago

Oops, I somehow missed your reply. I wonder if yaml anchors and aliases would help you.

Otherwise, one hack that comes to mind is to use dynamic inputs and outputs in your streams and then have Benthos craft configs for them in a manager stream and post them to the appropriate worker stream via the embedded REST API using the http_server output. That might be a bit convoluted, so not sure you want to go down that path. Alternatively, if you're only interested in a few fields from certain inputs, we could enhance them to support interpolation and then you could use the bloblang env function to produce their value.

disintegrator commented 2 years ago

What do you think of this idea for config variables? This isn't based on any personal needs; just musings…

variables:
  version: |
    root = env("GIT_SHA") || "unknown"
  topic: |
    root = "%s_myconfig".format(env("BASE_TOPIC_NAME"))

inputs:
  gcp_pubsub:
    topic: ${! var("topic") }
  processors:
    - mapping: |
        meta pipeline_version = var("version")

Benefit is that you have access to write full bloblang mappings that yield a string rather than noisy interpolated strings. You will still use interpolated strings to reference variables and you can also refer to them in other mappings/mutations.

It will also be possible to define variables in resource files so they're shareable between multiple configs.

mannharleen commented 2 years ago

Cant one use the cache for a global variable context?

disintegrator commented 2 years ago

@mannharleen not if you want those variables to configure inputs or outputs. Technically messages can carry config over to outputs (in metadata for example) but that's messy if the config is unrelated to the message in any way i.e. you're sideloading config to messages.

Also getting config from caches will require messy branch processors.

mannharleen commented 2 years ago

That's true. However, I was alluding to having a cache function available in bloblang. Which then makes it viable to configure inputs or outputs without messy branching.

I am for the motion for having a global var context; I just jumped into an alternative solution the mode.

ppavlov39 commented 2 years ago

Sorry for not replying for so long. Thanks for answers.

Oops, I somehow missed your reply. I wonder if yaml anchors and aliases would help you.

Otherwise, one hack that comes to mind is to use dynamic inputs and outputs in your streams and then have Benthos craft configs for them in a manager stream and post them to the appropriate worker stream via the embedded REST API using the http_server output. That might be a bit convoluted, so not sure you want to go down that path. Alternatively, if you're only interested in a few fields from certain inputs, we could enhance them to support interpolation and then you could use the bloblang env function to produce their value.

I thought about yaml anchors, but they can't solve the main problem, the config would still be too confusing, but shorter. Dynamic configuration is not applicable in my situation because the service is using in K8s and should only be controlled via configs and environment variables.

What do you think of this idea for config variables? This isn't based on any personal needs; just musings…

variables:
  version: |
    root = env("GIT_SHA") || "unknown"
  topic: |
    root = "%s_myconfig".format(env("BASE_TOPIC_NAME"))

inputs:
  gcp_pubsub:
    topic: ${! var("topic") }
  processors:
    - mapping: |
        meta pipeline_version = var("version")

Benefit is that you have access to write full bloblang mappings that yield a string rather than noisy interpolated strings. You will still use interpolated strings to reference variables and you can also refer to them in other mappings/mutations.

It will also be possible to define variables in resource files so they're shareable between multiple configs.

I think it is a great idea to implement variable parsing in the first step of config processing. This will solve the problem, I think.

darcyg commented 2 years ago

When will global variables be supported? The current configuration is too cumbersome. I am not a data processing user. I am a home automation user. I mostly use mqtt/http-restapi/redis. I really like some of the benthos designs, the input and output plugins are almost perfect. But the configuration is too painful. Very inflexible. Lots of repetitive typing.

I recommend two structured libraries for yaml. Can be used to normalize yaml templates https://github.com/mandelsoft/spiff https://github.com/vmware-tanzu/carvel-ytt Personally, I think that if benthos does data analysis based on the spiff library, it is better than the current practice.

I think yaml can be structured based on spiff. If there is output in other text formats, you can use yaml templated output, such as https://github.com/subchen/frep https://github.com/mmalcek/bafi

disintegrator commented 2 years ago

@darcyg we haven’t prioritised this issue yet. If you want something more convenient to work with than YAML then consider using CUE which Benthos supports. With CUE, you’ll have type safe configs and the ability to reuse values to cut down your config.

https://www.benthos.dev/docs/configuration/using_cue