redpanda-data / connect

Fancy stream processing made operationally mundane
https://docs.redpanda.com/redpanda-connect/about/
8.13k stars 829 forks source link

Auto Retry Toggle #2770

Open JuchangGit opened 2 months ago

JuchangGit commented 2 months ago

能否添加自动重试的开关或者控制方式,因为bento目前的工作方式是输入和输出错误发生错误时会一直重试下去,这对于有些场景是不适合的。 比如:输入为数据库时,上游如果改变了表结构,那么bento将不断重试,上游可能认为这是恶意攻击

能否添加控制重试的次数或是否自动重试的开关来控制bento的默认行为

mihaitodor commented 2 months ago

Via Google Translate:

Can you add an automatic retry switch or control method? Because the current working method of Bento is that it will keep retrying when input and output errors occur, which is not suitable for some scenarios. For example: when the input is a database, if the upstream changes the table structure, then Bento will keep retrying, and the upstream may think this is a malicious attack

Can you add a switch to control the number of retries or whether to retry automatically to control the default behavior of bento?

Hey @JuchangGit 👋 Thank you for reaching out!

it will keep retrying when input and output errors occur

Not sure what you mean by input error. Connect will keep trying to connect to an input until it succeeds or until the process is terminated. That is by design. It's up to the users to leverage either the /ready HTTP endpoint or metrics (or, worst case, logs) and take the appropriate action when this situation occurs.

For outputs, if Connect is able to establish a connection to the output, then it may get an error back. In such cases, you have various meta outputs such as drop_on or fallback or reject_errored or retry to control what should happen to the current message in such cases. Additionally, for example with fallback, you could have something like this:

output:
  switch:
    cases:
      - check: metadata("status") == "OK"
         output:
           fallback:
             - your_actual_output: ...
             - drop: {} # Feel free to replace this with a dead letter queue output (i.e. `kafka_franz`)
                processors:
                   - cache: Set a key called `status` in an in memory cache indicating that the above output is busted
      - output: # This is the catch-all output which is used when `metadata("status") != "OK"`
           drop: {} # Feel free to replace this with a dead letter queue output (i.e. `kafka_franz`)

  processors:
    - cache: # Fetch the `status` key from an in memory cache and set it in a metadata field called `status`
                   # You can use a TTL when setting the key so it expires after a while and allows the `your_actual_output` to be attempted again after this period lapses.

Can you add a switch to control the number of retries or whether to retry automatically to control the default behavior of bento?

You have full flexibility as described above. Another approach is to use the retry output I mentioned and configure exponential backoff: https://docs.redpanda.com/redpanda-connect/components/outputs/retry/#backoff and max_retries. You can have this as a child within a fallback output so you can redirect the messages to a dead letter queue (or just drop them) if max_retries lapse.

JuchangGit commented 2 months ago

是否可以为input和output提供一个配置项——最大重试次数 max_retry_num ,默认值为 -1 表示一直重试(和现在的机制一样), 让用户可以控制重试的次数。配置像下面这样:

input:
  max_retry_num: 2
  stdin:
    scanner:
      lines: {}
    auto_replay_nacks: true
buffer:
  none: {}
pipeline:
  threads: -1
  processors: []
output:
  max_retry_num: 3
  stdout:
    codec: lines
mihaitodor commented 2 months ago

Unfortunately, no, that's not currently possible like I mentioned above:

Connect will keep trying to connect to an input until it succeeds or until the process is terminated. That is by design. It's up to the users to leverage either the /ready HTTP endpoint or metrics (or, worst case, logs) and take the appropriate action when this situation occurs.

You can, however, use Streams Mode to have a separate watchdog stream which uses the generate input combined with the http processor to query the /ready HTTP endpoint and take the appropriate action when this indicates that the input isn't connected. You can even have this http processor retry several times to be sure that the connectivity issue isn't transient.