tikv / pd

Placement driver for TiKV
Apache License 2.0
1.06k stars 724 forks source link

tidb panic when injection pd leader io delay #8781

Closed Lily2025 closed 3 weeks ago

Lily2025 commented 3 weeks ago

Bug Report

What did you do?

1、run sysbench 2、add index for one table 3、injection pd leader io delay tidb-2024-11-04T11-37-04.520.log.zip

What did you expect to see?

no panic

What did you see instead?

tidb abnormally restart due to internal panic [2024/11/04 10:53:00.821 +08:00] [FATAL] [tso_stream.go:330] ["tsoStream.recvLoop internal panic"] [stacktrace="github.com/tikv/pd/client.(tsoStream).recvLoop.func1\n\t/root/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_stream.go:330\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:785\ngo.uber.org/zap/zapcore.CheckWriteAction.OnWrite\n\t/root/go/pkg/mod/go.uber.org/zap@v1.27.0/zapcore/entry.go:196\ngo.uber.org/zap/zapcore.(CheckedEntry).Write\n\t/root/go/pkg/mod/go.uber.org/zap@v1.27.0/zapcore/entry.go:262\ngo.uber.org/zap.(Logger).Panic\n\t/root/go/pkg/mod/go.uber.org/zap@v1.27.0/logger.go:285\ngithub.com/pingcap/log.Panic\n\t/root/go/pkg/mod/github.com/pingcap/log@v1.1.1-0.20240314023424-862ccc32f18d/global.go:54\ngithub.com/tikv/pd/client.(tsoDispatcher).checkMonotonicity\n\t/root/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_dispatcher.go:647\ngithub.com/tikv/pd/client.(tsoDispatcher).processRequests.func2\n\t/root/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_dispatcher.go:597\ngithub.com/tikv/pd/client.(tsoStream).recvLoop\n\t/root/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_stream.go:444"] [panicMessage="[tso] timestamp fallback"] [stack="github.com/tikv/pd/client.(tsoStream).recvLoop.func1\n\t/root/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_stream.go:330\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:785\ngo.uber.org/zap/zapcore.CheckWriteAction.OnWrite\n\t/root/go/pkg/mod/go.uber.org/zap@v1.27.0/zapcore/entry.go:196\ngo.uber.org/zap/zapcore.(CheckedEntry).Write\n\t/root/go/pkg/mod/go.uber.org/zap@v1.27.0/zapcore/entry.go:262\ngo.uber.org/zap.(Logger).Panic\n\t/root/go/pkg/mod/go.uber.org/zap@v1.27.0/logger.go:285\ngithub.com/pingcap/log.Panic\n\t/root/go/pkg/mod/github.com/pingcap/log@v1.1.1-0.20240314023424-862ccc32f18d/global.go:54\ngithub.com/tikv/pd/client.(tsoDispatcher).checkMonotonicity\n\t/root/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_dispatcher.go:647\ngithub.com/tikv/pd/client.(tsoDispatcher).processRequests.func2\n\t/root/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_dispatcher.go:597\ngithub.com/tikv/pd/client.(tsoStream).recvLoop\n\t/root/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_stream.go:444"] [2024/11/04 10:53:03.033 +08:00] [INFO] [meminfo.go:179] ["use cgroup memory hook because TiDB is in the container"] [2024/11/04 10:53:03.034 +08:00] [INFO] [printer.go:52] ["Welcome to TiDB."] ["Release Version"=v8.5.0-alpha-31-g241c10294b] [Edition=Community] ["Git Commit Hash"=241c10294b9c0305db590576700c00de306be573] ["Git Branch"=HEAD] ["UTC Build Time"="2024-11-01 16:29:12"] [GoVersion=go1.23.2] ["Race Enabled"=false] ["Check Table Before Drop"=false]

What version of PD are you using (pd-server -V)?

./pd-server -V Release Version: v8.5.0-alpha-3-g60b960a4 Edition: Community Git Commit Hash: 60b960a42341092038147c27dcf1f800b21a9a0d Git Branch: HEAD UTC Build Time: 2024-11-01 10:24:22 2024-11-04T10:31:24.388+0800

Lily2025 commented 3 weeks ago

/type bug /severity major /assign lhy1024

lhy1024 commented 3 weeks ago

After #8744, we support dynamic check tso service, which make startTSOJobsIfNeeded happen after stopTSOJobsIfNeeded

  1. tso sync failed and reset
    [2024/11/04 10:32:44.337 +08:00] [WARN] [tso.go:371] ["save timestamp failed"] [] [dc-location=global] [timestamp-path=timestamp] [error="context deadline exceeded"]
    [2024/11/04 10:32:44.337 +08:00] [WARN] [allocator_manager.go:781] ["failed to update allocator's timestamp"] [] [dc-location=global] [name=tc-pd-2] [error="context deadline exceeded"]
    [2024/11/04 10:32:44.337 +08:00] [INFO] [tso.go:437] ["reset the timestamp in memory"] []
  2. leader switch because keep alive failed and close service, I guess it also has run stopTSOJobsIfNeeded
    [2024/11/04 10:32:44.339 +08:00] [WARN] [lease.go:185] ["lease keep alive failed"] [purpose="leader election"] [start=2024/11/04 10:32:43.758 +08:00] [error="context canceled"]
    [2024/11/04 10:32:44.339 +08:00] [INFO] [lease.go:202] ["stop lease keep alive worker"] [purpose="leader election"]
    [2024/11/04 10:32:44.377 +08:00] [INFO] [server.go:1777] ["no longer a leader because lease has expired, pd leader will step down"]
  3. checkTSOService call startTSOJobsIfNeeded , although it was not leader
    [2024/11/04 10:32:49.350 +08:00] [INFO] [cluster.go:476] ["initializing the global TSO allocator"]
    [2024/11/04 10:32:49.352 +08:00] [INFO] [tso.go:161] ["start to sync timestamp"] []
    [2024/11/04 10:32:49.356 +08:00] [INFO] [tso.go:221] ["sync and save timestamp"] [] [last=2024/11/04 10:32:37.334 +08:00] [last-saved=0001/01/01 00:00:00.000 +00:00] [save=2024/11/04 10:32:52.355 +08:00] [next=2024/11/04 10:32:49.355 +08:00]
    [2024/11/04 10:32:49.359 +08:00] [INFO] [cluster.go:459] ["service check job is stopped"]
    [2024/11/04 10:33:19.423 +08:00] [INFO] [cluster.go:862] ["raft cluster is stopped"]
  4. it became leader again and no longer initialize tso until next sync
    [2024/11/04 10:53:00.806 +08:00] [INFO] [server.go:1768] ["PD leader is ready to serve"] [leader-name=tc-pd-2]
    ....
    [2024/11/04 10:53:00.834 +08:00] [WARN] [tso.go:335] ["clock offset"] [] [jet-lag=5m0.968119947s] [prev-physical=2024/11/04 10:47:59.866 +08:00] [now=2024/11/04 10:53:00.834 +08:00] [update-physical-interval=50ms]
  5. If client call tso will meet fallback
    [2024/11/04 10:53:00.821 +08:00] [PANIC] [tso_dispatcher.go:647] ["[tso] timestamp fallback"] [dc-location=global] [keyspace=4294967295] [last-ts="(1730688755816, 13414)"] [cur-ts="(1730688479866, 1)"] [last-tso-server=tc-pd-1.tc-pd-peer.endless-ha-test-add-index-tps-7651740-1-143.svc:2379] [cur-tso-server=tc-pd-2.tc-pd-peer.endless-ha-test-add-index-tps-7651740-1-143.svc:2379] [last-keyspace-group-in-request=0] [cur-keyspace-group-in-request=0] [last-keyspace-group-in-response=0] [cur-keyspace-group-in-response=0] [last-response-received-at=2024/11/04 10:52:40.175 +08:00] [cur-response-received-at=2024/11/04 10:53:00.821 +08:00] [last-stream-id=tc-pd-1.tc-pd-peer.endless-ha-test-add-index-tps-7651740-1-143.svc:2379-274] [cur-stream-id=tc-pd-2.tc-pd-peer.endless-ha-test-add-index-tps-7651740-1-143.svc:2379-522]
    [2024/11/04 10:53:00.821 +08:00] [FATAL] [tso_stream.go:330] ["tsoStream.recvLoop internal panic"] [stacktrace="github.com/tikv/pd/client.(*tsoStream).recvLoop.func1\n\t/root/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_stream.go:330\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:785\ngo.uber.org/zap/zapcore.CheckWriteAction.OnWrite\n\t/root/go/pkg/mod/go.uber.org/zap@v1.27.0/zapcore/entry.go:196\ngo.uber.org/zap/zapcore.(*CheckedEntry).Write\n\t/root/go/pkg/mod/go.uber.org/zap@v1.27.0/zapcore/entry.go:262\ngo.uber.org/zap.(*Logger).Panic\n\t/root/go/pkg/mod/go.uber.org/zap@v1.27.0/logger.go:285\ngithub.com/pingcap/log.Panic\n\t/root/go/pkg/mod/github.com/pingcap/log@v1.1.1-0.20240314023424-862ccc32f18d/global.go:54\ngithub.com/tikv/pd/client.(*tsoDispatcher).checkMonotonicity\n\t/root/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_dispatcher.go:647\ngithub.com/tikv/pd/client.(*tsoDispatcher).processRequests.func2\n\t/root/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_dispatcher.go:597\ngithub.com/tikv/pd/client.(*tsoStream).recvLoop\n\t/root/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_stream.go:444"] [panicMessage="[tso] timestamp fallback"] [stack="github.com/tikv/pd/client.(*tsoStream).recvLoop.func1\n\t/root/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_stream.go:330\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:785\ngo.uber.org/zap/zapcore.CheckWriteAction.OnWrite\n\t/root/go/pkg/mod/go.uber.org/zap@v1.27.0/zapcore/entry.go:196\ngo.uber.org/zap/zapcore.(*CheckedEntry).Write\n\t/root/go/pkg/mod/go.uber.org/zap@v1.27.0/zapcore/entry.go:262\ngo.uber.org/zap.(*Logger).Panic\n\t/root/go/pkg/mod/go.uber.org/zap@v1.27.0/logger.go:285\ngithub.com/pingcap/log.Panic\n\t/root/go/pkg/mod/github.com/pingcap/log@v1.1.1-0.20240314023424-862ccc32f18d/global.go:54\ngithub.com/tikv/pd/client.(*tsoDispatcher).checkMonotonicity\n\t/root/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_dispatcher.go:647\ngithub.com/tikv/pd/client.(*tsoDispatcher).processRequests.func2\n\t/root/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_dispatcher.go:597\ngithub.com/tikv/pd/client.(*tsoStream).recvLoop\n\t/root/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_stream.go:444"]

    last-ts 1730688479866 is 2024-11-04 10:52:35.816 and cur-ts 1730688479866 is 2024-11-04 10:47:59.866, which is the same with pd-server log

okJiang commented 3 weeks ago

Supplementary code screenshots for better understanding.

  1. Entered checkTSOService, at this point ctx is not yet Done. img_v3_02gd_4ab2c4f2-438d-4bc6-b17c-8ec2c8d1e52g
  2. The RaftCluster called Stop and canceled the context. img_v3_02gd_a50fe763-634b-4675-b61a-7e9224ee529g
  3. Stop called StopTSOJob.
  4. checkTSOService called StartTSOJob,

Caused TSO to not shut down properly.