pingcap / tiflow

This repo maintains DM (a data migration platform) and TiCDC (change data capture for TiDB)
Apache License 2.0
429 stars 287 forks source link

cdc panic when kafka sink rolling restart #9023

Closed fubinzh closed 1 year ago

fubinzh commented 1 year ago

What did you do?

  1. TiDB cluster with 3 CDC deployed (32C 64G each)
  2. There are 2 kafka changefeed, one for single big table, the other for 4k small tables. the lag is normal for both changefeed
  3. rolling restart the kafka sink (3 instances)

What did you expect to see?

cdc should not panic

What did you see instead?

cdc panic seen

[root@bogon bigCluster]# kubectl  --kubeconfig kubeconfig.yml -n cdc-kafka-big-cluster-tps-1712340-1-428 logs -p tc-ticdc-2
[WARN] TiCDC server data-dir is not set. Please use `cdc server --data-dir` to start the cdc server if possible.
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x12b2609]

goroutine 74332 [running]:
github.com/Shopify/sarama.(*partitionProducer).newHighWatermark(0xc0c18b9140, 0x3)
        github.com/Shopify/sarama@v1.36.0/async_producer.go:620 +0x1a9
github.com/Shopify/sarama.(*partitionProducer).dispatch(0xc0c18b9140)
        github.com/Shopify/sarama@v1.36.0/async_producer.go:564 +0x537
github.com/Shopify/sarama.withRecover(0xc1ea43e580?)
        github.com/Shopify/sarama@v1.36.0/utils.go:43 +0x3e
created by github.com/Shopify/sarama.(*asyncProducer).newPartitionProducer
        github.com/Shopify/sarama@v1.36.0/async_producer.go:515 +0x1ea

Versions of the cluster

cdc version: https://github.com/pingcap/tiflow/pull/9010

Release Version: v7.1.0
Git Commit Hash: 9b1497c7fba1d290443011f1d7d1e4305a125e1d
Git Branch: heads/refs/tags/v7.1.0
UTC Build Time: 2023-05-22 10:16:54
Go Version: go version go1.20.3 linux/amd64
Failpoint Build: false
asddongmen commented 1 year ago

This issue is a sarama bug: https://github.com/Shopify/sarama/issues/2322 It was fixed by: https://github.com/Shopify/sarama/commit/237925756e46ac948e583fceb81e8738c9f1e04b We need to bump cdc's sarama dependency to 1.38.1 to avoid this issue.

nongfushanquan commented 1 year ago

/severity critical

nongfushanquan commented 1 year ago

/remove-label affects-6.1