redpanda-data / redpanda

Redpanda is a streaming data platform for developers. Kafka API compatible. 10x faster. No ZooKeeper. No JVM!
https://redpanda.com
9.16k stars 561 forks source link

`Assert failure: (../../../src/v/raft/vote_stm.cc:278) '_ptr->_confirmed_term == _ptr->_term'` when a broker restarts after a failure to write to disk #7407

Open chris-kimberley opened 1 year ago

chris-kimberley commented 1 year ago

Version & Environment

Redpanda version: rpk version says latest (rev c8d4be2). This image was built by @travisdowns with the version name of v28_df58cced6e47

OS: uname -a: Linux redpanda-0 5.10.0-11-amd64 #1 SMP Debian 5.10.92-1 (2022-01-18) x86_64 GNU/Linux /etc/os-release:

PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

The issue occurred on a broker running in a self-hosted Kubernetes cluster. We're using a locally built Helm chart based off of the standard Redpanda Helm chart. Dedicated, persistent SSD with XFS for each pod. 2TB disks.

kubectl version:

Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.0", GitCommit:"ab69524f795c42094a6630298ff53f3c3ebab7f4", GitTreeState:"clean", BuildDate:"2021-12-07T18:16:20Z", GoVersion:"go1.17.3", Compiler:"gc", Platform:"windows/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.11", GitCommit:"5824e3251d294d324320db85bf63a53eb0767af2", GitTreeState:"clean", BuildDate:"2022-06-16T05:33:55Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

Statefulset manifest (redacted):

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: redpanda
  namespace: "redacted"
  labels:
    helm.sh/chart: redacted
    app.kubernetes.io/name: redacted
    app.kubernetes.io/instance: "redacted"
    app.kubernetes.io/managed-by: "Tiller"
    app.kubernetes.io/component: redacted
    env: prod

spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: redacted
      app.kubernetes.io/instance: "redacted"
  serviceName: redpanda
  replicas: 32
  updateStrategy:
    type: OnDelete

  podManagementPolicy: "Parallel"
  template:
    metadata:
      labels:
        app.kubernetes.io/name: redacted
        app.kubernetes.io/instance: "redacted"
        app.kubernetes.io/component: redacted
        env: prod

    spec:
      securityContext:
        fsGroup: 101

      # TODO:
      # * Figure out what to do about node_id / seeds here - the operator will fix this separately
      # * Once that's done, this initContainer can be removed
      initContainers:
        - name: redpanda-configurator
          image: our-local-docker-hub/vectorized/redpanda:v28_df58cced6e47
          command: ["/bin/sh", "-c"]
          env:
            - name: SERVICE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
          args:
            - >
              CONFIG=/etc/redpanda/redpanda.yaml;
              NODE_ID=${SERVICE_NAME##*-};
              cp /tmp/base-config/redpanda.yaml "$CONFIG";
              echo 1048576 > /proc/sys/fs/aio-max-nr;
              rpk --config "$CONFIG" config set redpanda.node_id $NODE_ID;
              if [ "$NODE_ID" = "0" ]; then
                rpk --config "$CONFIG" config set redpanda.seed_servers '[]' --format yaml;
              fi;
          volumeMounts:
            - name: redpanda
              mountPath: /tmp/base-config 
            - name: config
              mountPath: /etc/redpanda
          resources:
            limits:
              cpu: 16
              memory: 32Gi
            requests:
              cpu: 16
              memory: 32Gi

      containers:
        - name: redpanda
          image: our-local-docker-hub/vectorized/redpanda:v28_df58cced6e47
          env:
            - name: SERVICE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: POD_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
          args:
            - >
              redpanda
              start
              --smp=32
              --memory=250G
              --reserve-memory=0M
              --advertise-kafka-addr=$(POD_IP):9092
              --kafka-addr=$(POD_IP):9092
              --rpc-addr=$(POD_IP):33145
              --advertise-rpc-addr=$(POD_IP):33145
              --default-log-level=error
              --blocked-reactor-notify-ms=200
              --abort-on-seastar-bad-alloc
              --logger-log-level=seastar_memory=trace
              --max-networking-io-control-blocks=30000
          ports:
            - containerPort: 9644
              name: admin
            - containerPort: 9092
              name: kafka
            - containerPort: 33145
              name: rpc
          volumeMounts:
            - name: datadir
              mountPath: /var/lib/redpanda/data
            - name: config
              mountPath: /etc/redpanda
          resources:
            limits:
              memory: 256Gi
            requests:
              cpu: 32
              memory: 256Gi

      volumes:
        - name: datadir
          persistentVolumeClaim:
            claimName: datadir
        - name: redpanda
          configMap:
            name: redpanda
        - name: config
          emptyDir: {}
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                topologyKey: kubernetes.io/hostname
                labelSelector:
                  matchLabels:
                    app.kubernetes.io/name: redacted
                    app.kubernetes.io/instance: "redacted"
      priorityClassName: solidio-localdisk
  volumeClaimTemplates:
    - metadata:
        name: datadir
        labels:
          app.kubernetes.io/name: redacted
          app.kubernetes.io/instance: "redacted"
          app.kubernetes.io/component: redacted
          env: prod

      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: "static-full-disk-xfs"
        resources:
          requests:
            storage: "300Gi"

We're using Python confluent-kafka library (librdkafka-based).

What went wrong?

After running stably for weeks we encountered this assert 3 times (with different values for fallocation_offset and committed_offset) on one broker, followed immediately by that broker crashing. Assert failure: (../../../src/v/storage/segment_appender.cc:507) 'false' Could not dma_write: std::__1::system_error (error system:5, Input/output error) {no_of_chunks:64, closed:0, fallocation_offset:33554432, committed_offset:11748024, bytes_flush_pending:0} The root cause of this write failure is not known. We suspect that it was caused by an issue on the host system, not within the Redpanda container.

Kubernetes restarted the pod using the same PV. But the broker was unable to recover and began crashlooping. Each crash was caused by the following assert/callstack:

ERROR 2022-11-20 03:33:13,754782 [shard 6 seq 1] assert - Assert failure: (../../../src/v/raft/vote_stm.cc:278) '_ptr->_confirmed_term == _ptr->_term' successfully replicated configuration should update _confirmed_term=-9223372036854775808 to be equal to _term=43
ERROR 2022-11-20 03:33:13,754867 [shard 6 seq 2] assert - Backtrace below:
0x4d0b484 0x1ea0e59 0x4abc1df 0x4abfeb7 0x4b033b5 0x4a5d19f /opt/redpanda/lib/libpthread.so.0+0x8608 /opt/redpanda/lib/libc.so.6+0x11f132
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<void>, raft::vote_stm::update_vote_state(seastar::semaphore_units<seastar::named_semaphore_exception_factory, std::__1::chrono::steady_clock>)::$_6, seastar::future<void> seastar::future<std::__1::error_code>::then_impl_nrvo<raft::vote_stm::update_vote_state(seastar::semaphore_units<seastar::named_semaphore_exception_factory, std::__1::chrono::steady_clock>)::$_6, seastar::future<void> >(raft::vote_stm::update_vote_state(seastar::semaphore_units<seastar::named_semaphore_exception_factory, std::__1::chrono::steady_clock>)::$_6&&)::'lambda'(seastar::internal::promise_base_with_type<void>&&, raft::vote_stm::update_vote_state(seastar::semaphore_units<seastar::named_semaphore_exception_factory, std::__1::chrono::steady_clock>)::$_6&, seastar::future_state<std::__1::error_code>&&), std::__1::error_code>
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<void>, raft::consensus::dispatch_vote(bool)::$_11::operator()() const::'lambda'(bool)::operator()(bool)::'lambda'(seastar::future<void>), seastar::futurize<seastar::future<void> >::type seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, raft::consensus::dispatch_vote(bool)::$_11::operator()() const::'lambda'(bool)::operator()(bool)::'lambda'(seastar::future<void>)>(raft::consensus::dispatch_vote(bool)::$_11::operator()() const::'lambda'(bool)::operator()(bool)::'lambda'(seastar::future<void>)&&)::'lambda'(seastar::internal::promise_base_with_type<void>&&, raft::consensus::dispatch_vote(bool)::$_11::operator()() const::'lambda'(bool)::operator()(bool)::'lambda'(seastar::future<void>)&, seastar::future_state<seastar::internal::monostate>&&), void>
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<raft::consensus::dispatch_vote(bool)::$_11::operator()() const::'lambda'(), false>, seastar::futurize<seastar::future<void> >::type seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<raft::consensus::dispatch_vote(bool)::$_11::operator()() const::'lambda'(), false> >(seastar::future<void>::finally_body<raft::consensus::dispatch_vote(bool)::$_11::operator()() const::'lambda'(), false>&&)::'lambda'(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<raft::consensus::dispatch_vote(bool)::$_11::operator()() const::'lambda'(), false>&, seastar::future_state<seastar::internal::monostate>&&), void>
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<auto seastar::internal::invoke_func_with_gate<raft::consensus::dispatch_vote(bool)::$_11>(seastar::gate&, raft::consensus::dispatch_vote(bool)::$_11&&)::'lambda'(), false>, seastar::futurize<raft::consensus::dispatch_vote(bool)::$_11>::type seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<auto seastar::internal::invoke_func_with_gate<raft::consensus::dispatch_vote(bool)::$_11>(seastar::gate&, raft::consensus::dispatch_vote(bool)::$_11&&)::'lambda'(), false> >(seastar::future<void>::finally_body<auto seastar::internal::invoke_func_with_gate<raft::consensus::dispatch_vote(bool)::$_11>(seastar::gate&, raft::consensus::dispatch_vote(bool)::$_11&&)::'lambda'(), false>&&)::'lambda'(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<auto seastar::internal::invoke_func_with_gate<raft::consensus::dispatch_vote(bool)::$_11>(seastar::gate&, raft::consensus::dispatch_vote(bool)::$_11&&)::'lambda'(), false>&, seastar::future_state<seastar::internal::monostate>&&), void>
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void> seastar::future<void>::handle_exception_type<auto ssx::spawn_with_gate_then<raft::consensus::dispatch_vote(bool)::$_11>(seastar::gate&, raft::consensus::dispatch_vote(bool)::$_11&&)::'lambda'(seastar::abort_requested_exception const&)>(raft::consensus::dispatch_vote(bool)::$_11&&)::'lambda'(raft::consensus::dispatch_vote(bool)::$_11&&), seastar::futurize<raft::consensus::dispatch_vote(bool)::$_11>::type seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void> seastar::future<void>::handle_exception_type<auto ssx::spawn_with_gate_then<raft::consensus::dispatch_vote(bool)::$_11>(seastar::gate&, raft::consensus::dispatch_vote(bool)::$_11&&)::'lambda'(seastar::abort_requested_exception const&)>(raft::consensus::dispatch_vote(bool)::$_11&&)::'lambda'(raft::consensus::dispatch_vote(bool)::$_11&&)>(seastar::future<void> seastar::future<void>::handle_exception_type<auto ssx::spawn_with_gate_then<raft::consensus::dispatch_vote(bool)::$_11>(seastar::gate&, raft::consensus::dispatch_vote(bool)::$_11&&)::'lambda'(seastar::abort_requested_exception const&)>(raft::consensus::dispatch_vote(bool)::$_11&&)::'lambda'(raft::consensus::dispatch_vote(bool)::$_11&&)&&)::'lambda'(seastar::internal::promise_base_with_type<void>&&, seastar::future<void> seastar::future<void>::handle_exception_type<auto ssx::spawn_with_gate_then<raft::consensus::dispatch_vote(bool)::$_11>(seastar::gate&, raft::consensus::dispatch_vote(bool)::$_11&&)::'lambda'(seastar::abort_requested_exception const&)>(raft::consensus::dispatch_vote(bool)::$_11&&)::'lambda'(raft::consensus::dispatch_vote(bool)::$_11&&)&, seastar::future_state<seastar::internal::monostate>&&), void>
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void> seastar::future<void>::handle_exception_type<auto ssx::spawn_with_gate_then<raft::consensus::dispatch_vote(bool)::$_11>(seastar::gate&, raft::consensus::dispatch_vote(bool)::$_11&&)::'lambda'(seastar::gate_closed_exception const&)>(raft::consensus::dispatch_vote(bool)::$_11&&)::'lambda'(raft::consensus::dispatch_vote(bool)::$_11&&), seastar::futurize<raft::consensus::dispatch_vote(bool)::$_11>::type seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void> seastar::future<void>::handle_exception_type<auto ssx::spawn_with_gate_then<raft::consensus::dispatch_vote(bool)::$_11>(seastar::gate&, raft::consensus::dispatch_vote(bool)::$_11&&)::'lambda'(seastar::gate_closed_exception const&)>(raft::consensus::dispatch_vote(bool)::$_11&&)::'lambda'(raft::consensus::dispatch_vote(bool)::$_11&&)>(seastar::future<void> seastar::future<void>::handle_exception_type<auto ssx::spawn_with_gate_then<raft::consensus::dispatch_vote(bool)::$_11>(seastar::gate&, raft::consensus::dispatch_vote(bool)::$_11&&)::'lambda'(seastar::gate_closed_exception const&)>(raft::consensus::dispatch_vote(bool)::$_11&&)::'lambda'(raft::consensus::dispatch_vote(bool)::$_11&&)&&)::'lambda'(seastar::internal::promise_base_with_type<void>&&, seastar::future<void> seastar::future<void>::handle_exception_type<auto ssx::spawn_with_gate_then<raft::consensus::dispatch_vote(bool)::$_11>(seastar::gate&, raft::consensus::dispatch_vote(bool)::$_11&&)::'lambda'(seastar::gate_closed_exception const&)>(raft::consensus::dispatch_vote(bool)::$_11&&)::'lambda'(raft::consensus::dispatch_vote(bool)::$_11&&)&, seastar::future_state<seastar::internal::monostate>&&), void>
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void> seastar::future<void>::handle_exception_type<auto ssx::spawn_with_gate_then<raft::consensus::dispatch_vote(bool)::$_11>(seastar::gate&, raft::consensus::dispatch_vote(bool)::$_11&&)::'lambda'(seastar::broken_semaphore const&)>(raft::consensus::dispatch_vote(bool)::$_11&&)::'lambda'(raft::consensus::dispatch_vote(bool)::$_11&&), seastar::futurize<raft::consensus::dispatch_vote(bool)::$_11>::type seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void> seastar::future<void>::handle_exception_type<auto ssx::spawn_with_gate_then<raft::consensus::dispatch_vote(bool)::$_11>(seastar::gate&, raft::consensus::dispatch_vote(bool)::$_11&&)::'lambda'(seastar::broken_semaphore const&)>(raft::consensus::dispatch_vote(bool)::$_11&&)::'lambda'(raft::consensus::dispatch_vote(bool)::$_11&&)>(seastar::future<void> seastar::future<void>::handle_exception_type<auto ssx::spawn_with_gate_then<raft::consensus::dispatch_vote(bool)::$_11>(seastar::gate&, raft::consensus::dispatch_vote(bool)::$_11&&)::'lambda'(seastar::broken_semaphore const&)>(raft::consensus::dispatch_vote(bool)::$_11&&)::'lambda'(raft::consensus::dispatch_vote(bool)::$_11&&)&&)::'lambda'(seastar::internal::promise_base_with_type<void>&&, seastar::future<void> seastar::future<void>::handle_exception_type<auto ssx::spawn_with_gate_then<raft::consensus::dispatch_vote(bool)::$_11>(seastar::gate&, raft::consensus::dispatch_vote(bool)::$_11&&)::'lambda'(seastar::broken_semaphore const&)>(raft::consensus::dispatch_vote(bool)::$_11&&)::'lambda'(raft::consensus::dispatch_vote(bool)::$_11&&)&, seastar::future_state<seastar::internal::monostate>&&), void>
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void> seastar::future<void>::handle_exception_type<auto ssx::spawn_with_gate_then<raft::consensus::dispatch_vote(bool)::$_11>(seastar::gate&, raft::consensus::dispatch_vote(bool)::$_11&&)::'lambda'(seastar::broken_condition_variable const&)>(raft::consensus::dispatch_vote(bool)::$_11&&)::'lambda'(raft::consensus::dispatch_vote(bool)::$_11&&), seastar::futurize<raft::consensus::dispatch_vote(bool)::$_11>::type seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void> seastar::future<void>::handle_exception_type<auto ssx::spawn_with_gate_then<raft::consensus::dispatch_vote(bool)::$_11>(seastar::gate&, raft::consensus::dispatch_vote(bool)::$_11&&)::'lambda'(seastar::broken_condition_variable const&)>(raft::consensus::dispatch_vote(bool)::$_11&&)::'lambda'(raft::consensus::dispatch_vote(bool)::$_11&&)>(seastar::future<void> seastar::future<void>::handle_exception_type<auto ssx::spawn_with_gate_then<raft::consensus::dispatch_vote(bool)::$_11>(seastar::gate&, raft::consensus::dispatch_vote(bool)::$_11&&)::'lambda'(seastar::broken_condition_variable const&)>(raft::consensus::dispatch_vote(bool)::$_11&&)::'lambda'(raft::consensus::dispatch_vote(bool)::$_11&&)&&)::'lambda'(seastar::internal::promise_base_with_type<void>&&, seastar::future<void> seastar::future<void>::handle_exception_type<auto ssx::spawn_with_gate_then<raft::consensus::dispatch_vote(bool)::$_11>(seastar::gate&, raft::consensus::dispatch_vote(bool)::$_11&&)::'lambda'(seastar::broken_condition_variable const&)>(raft::consensus::dispatch_vote(bool)::$_11&&)::'lambda'(raft::consensus::dispatch_vote(bool)::$_11&&)&, seastar::future_state<seastar::internal::monostate>&&), void>
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void> seastar::future<void>::handle_exception<raft::consensus::dispatch_vote(bool)::$_51>(raft::consensus::dispatch_vote(bool)::$_51&&)::'lambda'(raft::consensus::dispatch_vote(bool)::$_51&&), seastar::futurize<raft::consensus::dispatch_vote(bool)::$_51>::type seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void> seastar::future<void>::handle_exception<raft::consensus::dispatch_vote(bool)::$_51>(raft::consensus::dispatch_vote(bool)::$_51&&)::'lambda'(raft::consensus::dispatch_vote(bool)::$_51&&)>(seastar::future<void> seastar::future<void>::handle_exception<raft::consensus::dispatch_vote(bool)::$_51>(raft::consensus::dispatch_vote(bool)::$_51&&)::'lambda'(raft::consensus::dispatch_vote(bool)::$_51&&)&&)::'lambda'(seastar::internal::promise_base_with_type<void>&&, seastar::future<void> seastar::future<void>::handle_exception<raft::consensus::dispatch_vote(bool)::$_51>(raft::consensus::dispatch_vote(bool)::$_51&&)::'lambda'(raft::consensus::dispatch_vote(bool)::$_51&&)&, seastar::future_state<seastar::internal::monostate>&&), void>

I was able to stop the crashlooping by putting the broker in maintenance mode. It was then able to join the cluster and remain as a "healthy" member. We have not tried to disable maintenance mode on that broker.

What should have happened instead?

After the failure to write and the pod being restarted Redpanda should have been able to recover and the broker should have rejoined the cluster correctly.

How to reproduce the issue?

Unknown.

Additional information

All metrics up to the point of the initial crash for that broker were in line with other brokers in the cluster. Slack thread with some discussion.

JIRA Link: CORE-1095

chris-kimberley commented 1 year ago

I'm really interested to know why the broker was unable to recover and what impact we might have from disabling maintenance mode on the broker.

jcsp commented 1 year ago

SolidIO

Can you say more about what this is? I googled "SolidIO", "SolidIO storage" etc and couldn't find any references at all.

chris-kimberley commented 1 year ago

As it turns out that's the name of the internally developed persistent disk management we use within Kubernetes (thought it was external). It just handles PV lifetimes and such. You can ignore it.

chris-kimberley commented 1 year ago

We made an attempt to bring the broker out of maintenance mode. We can see from metrics that leadership of partitions was transferred to it. But it started hitting the same assert again and we had to put it back into maintenance mode to prevent further impact.