knative-extensions / eventing-kafka-broker

Alternate Kafka Broker implementation.
Apache License 2.0
167 stars 111 forks source link

Do delivery specs in broker configs apply to event processing errors as well as delivery failures? #3746

Open gilesrapkinb opened 4 months ago

gilesrapkinb commented 4 months ago

Describe the bug

This is more of a question than a bug report.

I've recently set up Knative eventing with Kafka brokers; one main broker, with another Kafka broker configured as a dead letter sink. Also a delivery spec with retries, exponential backoff etc. configured.

I was trying to test the dead letter sink setup with a test function that always returns a 500 in response to any events. The retry behaviour is not quite what I expected. Events seem to get retried forever instead of the retry count specified in the delivery spec.

I am able to get events to get sent to the dead letter sink under other scenarios, e.g. setting the test function to timeout, or respond in a way that doesn't conform to the delivery contract (like responding with a 200 without a cloudevent in the response body).

Is a function responding with a 500 error not considered a delivery failure? If not, what is the expected behaviour in this scenario? Is there another way to configure retry/backoff/dead letter behaviour for event processing failures?

Expected behavior

I initially expected the delivery spec to also apply to events that weren't processed successfully by a function that received the event (responded with a 500).

To Reproduce

  1. Create 2 kafka based brokers, and configure a delivery policy on the main broker.
    apiVersion: eventing.knative.dev/v1
    kind: Broker
    metadata:
      annotations:
        eventing.knative.dev/broker.class: Kafka
      labels:
        app: knative-broker
      name: knative-broker
      namespace: default
    spec:
      config:
        apiVersion: v1
        kind: ConfigMap
        name: knative-broker
        namespace: default
      delivery:
        backoffDelay: PT0.5S
        backoffPolicy: exponential
        deadLetterSink:
          ref:
            apiVersion: eventing.knative.dev/v1
            kind: Broker
            name: knative-broker-deadletter
            namespace: default
        retry: 12
    ---
    apiVersion: eventing.knative.dev/v1
    kind: Broker
    metadata:
      annotations:
        eventing.knative.dev/broker.class: Kafka
      labels:
        app: knative-broker
      name: knative-broker-deadletter
      namespace: default
    spec:
      config:
        apiVersion: v1
        kind: ConfigMap
        name: knative-broker
        namespace: default
      delivery:
        backoffDelay: PT0.2S
        backoffPolicy: exponential
        retry: 10
  2. Create a function that always returns a 500 in response to an event, deploy, with a trigger
  3. Send a test event to the broker, see it get repeatedly retried and not sent to the dead letter sink

    Knative release version

1.13.1

Additional context

N/A

pierDipi commented 4 months ago

Thanks for reporting @gilesrapkinb, this is not expected, would you mind also sharing the step 2 and 3, so that we could just run them and debug the issue?

gilesrapkinb commented 4 months ago

For step 2 you could use something like nicholasjackson/fake-service.

Deploy the function:

kn service create test500 \
--image nicholasjackson/fake-service:v0.26.2 \
--env 'LISTEN_ADDR=0.0.0.0:8080' \
--env 'ERROR_RATE=1'

Set up a trigger:

kn trigger create test500 \
--sink test500  \
--broker YOUR_BROKER

Then for step 3 push a test event to your broker.

In the kafka-broker-dispatcher logs you'll see repeated attempts to deliver the message (more than the retry setting in the broker delivery config above), with errors logged for the response from the function that look something like this:

{
  "@timestamp": "2024-03-12T19:40:36.969Z",
  "@version": "1",
  "message": "Received a failure status code that is not 2xx. failed to send event to subscriber context={topics=[knative-broker-dev-knative-broker], consumerGroup='knative-trigger-dev-test500', reference=uuid: \"f4666f13-7438-4a31-bb05-e248383d3466\"\nnamespace: \"dev\"\nname: \"test500\"\n} target=http://test500.dev.svc.cluster.local statusCode=500",
  "logger_name": "dev.knative.eventing.kafka.broker.dispatcher.impl.http.WebClientCloudEventSender",
  "thread_name": "vert.x-worker-thread-3",
  "level": "ERROR",
  "level_value": 40000,
  "context": {},
  "target": "http://test500.dev.svc.cluster.local",
  "statusCode": 500
}
github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

gilesrapkinb commented 2 weeks ago

/remove-lifecycle stale

gilesrapkinb commented 1 week ago

@pierDipi has anybody been able to reproduce this yet?