Benthos doesn't resume consuming after redpanda runs out of space on disk

ZakMiller commented 3 years ago

We're using benthos as a redpanda consumer and we recently saw a situation where redpanda ran out of space on disk, causing the (k8s) pod to become unhealthy. It was in this state for a while, I think a few days.

We resolved the issue by resizing the pvc and restarting the kafka pod, and that fixed kafka, but we noticed that benthos wasn't consuming anymore. We fixed it by restarting the pod, but that shouldn't be necessary, right?

I took a look at the docs but didn't see anything that would suggest the behavior in a situation like this.

I opened an issue becuase @mihaitodor suggested it.

Any help would be appreciated!

ZakMiller commented 3 years ago

I originally mentioned that the problem could have been suggested by the logs.

We were seeing this:

{"@timestamp":"2021-10-18T19:30:20Z","@service":"benthos","component":"benthos.input","level":"DEBUG","message":"Starting consumer group"}

rather than this:

{"@timestamp":"2021-10-22T19:06:25Z","@service":"benthos","component":"benthos.input","level":"DEBUG","message":"Starting consumer group"}
{"@timestamp":"2021-10-22T19:06:25Z","@service":"benthos","component":"benthos.input","level":"DEBUG","message":"Consuming messages from topic 'cluster_logs' partition '0'"}

I now think that's a red herring. We have a few benthos instances acting as redpanda consumers on different topics and I'm seeing different behavior with the logging (with the same log levels). When I deleted the redpanda pod and it restarted I saw log messages without the "Consuming messages from topic..." and yet it was still consuming messages. So I'm not sure if that's a separate bug or what, but I don't think it's connected.

mihaitodor commented 3 years ago

Thanks for the writeup @ZakMiller! Just to clarify, have you tried to reproduce it and now you're seeing different behaviour? Also, would you mind sharing a simplified Benthos config as well as the config you're using for the RedPanda docker container, so I can try to reproduce this locally using the same setup you have?

ZakMiller commented 3 years ago

I tried to reproduce it by just shutting off redpanda for a period of time (rather than the slightly different behavior at the time, which was it being down for a multi-day period due to the disk filling up).

apiVersion: redpanda.vectorized.io/v1alpha1
kind: Cluster
metadata:
  name: redpanda
  namespace: redpanda
spec:
  image: <redpanda image>
  version: "latest"
  replicas: 1
  resources:
    requests:
      cpu: 500m
      memory: 2Gi
    limits:
      cpu: '2'
      memory: 4Gi
  configuration:
    rpcServer:
      port: 33145
    kafkaApi:
     - port: 9092
    pandaproxyApi:
     - port: 8082
    adminApi:
    - port: 9644
    autoCreateTopics: true

apiVersion: apps/v1
kind: Deployment
metadata:
  name: benthos-stream-events
  namespace: dekn-app
  labels:
    app: benthos-stream-events
spec:
  replicas: 1
  selector:
    matchLabels:
      app: benthos-stream-events
  template:
    metadata:
      labels:
        app: benthos-stream-events
    spec:
      containers:
        - name: benthos-stream-events
          image: <benthos image>
          volumeMounts:
            - name: benthos-stream-events-conf
              mountPath: /benthos.yaml
              subPath: benthos.yaml
              readOnly: true
      volumes:
        - name: benthos-stream-events-conf
          configMap:
            name: benthos-stream-events-config
            items:
              - key: benthos.yaml
                path: benthos.yaml
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: benthos-stream-events-config
#https://www.benthos.dev/docs/components/outputs/http_client
data:
  benthos.yaml: |-
    logger:
      level: ALL
    input:
      kafka:
        addresses:
          - redpanda.redpanda.svc.cluster.local:9092
        topics: [ events ]
        consumer_group: benthos_stream_http_events
    output:
      try:
        - http_client:
            url: <event-processor-url>
            verb: POST
            retries: 3
            oauth2:
              enabled: true
              client_key: "${KEYCLOAK_CLIENT_ID}"
              client_secret: "${KEYCLOAK_CLIENT_SECRET}"
              token_url: <token-url>
            headers:
              Content-Type: application/json
            rate_limit: ""
            timeout: 5s
            max_in_flight: 1
            retry_period: 1s
        - kafka:
            addresses:
              - redpanda.redpanda.svc.cluster.local:9092
            topic: async_events_dead
        - stdout:
            codec: lines

I appreciate the help @mihaitodor

redpanda-data / connect

Benthos doesn't resume consuming after redpanda runs out of space on disk #919