vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
18.13k stars 1.6k forks source link

Segfault in Kafka source (rd_kafka_toppar_set_fetch_state) when using client.rack option #8750

Open fpytloun opened 3 years ago

fpytloun commented 3 years ago

Vector Version

0.15.1-debian Docker image

Vector Configuration File

    [sources.in_kafka_access]
      type = "kafka"
      bootstrap_servers = "kafka-0:9093"
      group_id = "vector_access"
      topics = ["^fluentd.(log)_.*$"]
      librdkafka_options."client.id" = "${HOSTNAME}.region1"
      librdkafka_options."client.rack" = "region1"
      librdkafka_options."group.instance.id" = "${HOSTNAME}.region1"
      # Prefer roundrobin balancing to spread load more evenly
      librdkafka_options."partition.assignment.strategy" = "roundrobin,range"
      topic_key = "_topic"
      partition_key = "_partition"
      offset_key = "_offset"

      tls.enabled = true
      tls.ca_file = "/secrets/identity/server_ca.crt"
      tls.crt_file = "/secrets/identity/client.crt"
      tls.key_file = "/secrets/identity/client.key"

Kafka is deployed in two regions (3 nodes per region). there's 120+ topics total (400+ partitions). Nothing useful in Kafka logs, just information that consumer failed and was removed from group.

Running Kafka version 2.8.0

Debug Output

*** /cargo/registry/src/github.com-1ecc6299db9ec823/rdkafka-sys-4.0.0+1.6.1/librdkafka/src/rdkafka_partition.c:346:rd_kafka_toppar_set_fetch_state: assert: thrd_is_current(rktp->rktp_rkt->rkt_rk->rk_thread) ***

Debug output nor RUST_BACKTRACE gives no additional info :-(

This happens all the time 10~15 minutes after startup. Until that, Kafka messages are consumed properly.

Expected Behavior

Actual Behavior

*** /cargo/registry/src/github.com-1ecc6299db9ec823/rdkafka-sys-4.0.0+1.6.1/librdkafka/src/rdkafka_partition.c:346:rd_kafka_toppar_set_fetch_state: assert: thrd_is_current(rktp->rktp_rkt->rkt_rk->rk_thread) ***

Example Data

Additional Context

References

fpytloun commented 3 years ago

I found out that this issue does not happen if I disable rack awareness by removing following line: librdkafka_options."client.rack" = "region1"

jszwedko commented 2 years ago

Related issue in librdkafka: https://github.com/edenhill/librdkafka/issues/3569

fpytloun commented 2 years ago

Interesting comment upstream - https://github.com/edenhill/librdkafka/issues/3569#issuecomment-1300050112

Can we check how commiting offsets is handled in Vector? Whether it's not same issue with enable.auto.commit being enabled here: https://github.com/vectordotdev/vector/blob/30706de871a16c039d2e3249fb7e71ba9597b3cd/src/sources/kafka.rs#L482 vs. how it's actually handled?