pingcap / tidb

TiDB - the open-source, cloud-native, distributed SQL database designed for modern applications.
https://pingcap.com
Apache License 2.0
37.4k stars 5.85k forks source link

unexpected restart of tidb-server #53958

Open XuHuaiyu opened 5 months ago

XuHuaiyu commented 5 months ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

No reproduce step

2. What did you expect to see? (Required)

No restart

3. What did you see instead (Required)

On March 19 and June 3, there were abnormal restarts of a tidb-server on different physical machines within the same user cluster. The business load was normal, and there were no anomalies in the tidb.log file. However, some abnormal information can be seen in the tidb_err.log file. A portion of the tidb_err.log is provided below, with the full file available in the attached document. This cluster uses ARM architecture machines, but out of the hundreds of ARM-based TiDB clusters the user has, only this particular cluster has experienced restarts.

runtime: traceback stuck. pc=0x3973c50 sp=0x426e7d9fb0
stack: frame={sp:0x426e7d9fb0, fp:0x426e7d9fb0} stack=[0x426e7d6000,0x426e7da000)
0x000000426e7d9eb0:  0x0001000000000001  0x000000419b6fefb0 
0x000000426e7d9ec0:  0x00000042931231b8  0x0000000004c76e60 
0x000000426e7d9ed0:  0x000000419b6fefb0  0x000000410b379f18 
0x000000426e7d9ee0:  0x0000000001d1d0f0 <google.golang.org/grpc/internal/transport.(*loopyWriter).run+0x0000000000000060>  0x00000041d76d6540 
0x000000426e7d9ef0:  0x0000000000000000  0x00000041d76d6720 
0x000000426e7d9f00:  0x0000000000000000  0x0000000000000001 
0x000000426e7d9f10:  0x0000000001d261d4 <google.golang.org/grpc/internal/transport.newHTTP2Client.func3+0x0000000000000064>  0x000000419b6fefa0 
0x000000426e7d9f20:  0x0000000003f7cc01  0x00000042931231b8 
0x000000426e7d9f30:  0x0000000003973c1c <github.com/pingcap/tidb/executor.(*IndexLookUpExecutor).startTableWorker.func1+0x00000000000000bc>  0x010000415dc8ebd0 
0x000000426e7d9f40:  0x0000000000000000  0x0000000000000000 
0x000000426e7d9f50:  0x0000000001d1d330 <google.golang.org/grpc/internal/transport.(*loopyWriter).run.func1+0x0000000000000000>  0x000000426e7d9f40 
0x000000426e7d9f60:  0x000000426e7d9f50  0x0000000000000000 
0x000000426e7d9f70:  0x000000000187de34 <runtime.goexit+0x0000000000000004>  0x000000422bf8ad80 
0x000000426e7d9f80:  0x000000415dc8eb80  0x000000419b6fefa0 
0x000000426e7d9f90:  0x0000000000000000  0x0000004027712000 
0x000000426e7d9fa0:  0x00000041904ec4c0  0x000000402efd8460 
0x000000426e7d9fb0: >0x0000000003973c50 <github.com/pingcap/tidb/executor.(*IndexLookUpExecutor).startTableWorker.func1.1+0x0000000000000000>  0x00000042482f6000 
0x000000426e7d9fc0:  0x000000405ad3afb0  0x0000000000000000 
0x000000426e7d9fd0:  0x0000000000000000  0x0000000000000000 
0x000000426e7d9fe0:  0x0000000000000000  0x0000000000000000 
0x000000426e7d9ff0:  0x0000000000000000  0x0000000000000000 
fatal error: traceback stuck

goroutine 0 [idle]:
runtime.throw({0x453213d?, 0x6ab0800?})
    /usr/local/go/src/runtime/panic.go:1047 +0x40 fp=0x400062c5e0 sp=0x400062c5b0 pc=0x1847df0
runtime.gentraceback(0x71b3f20?, 0x0?, 0x400062c9a8?, 0x40000ccea0, 0x0, 0x400062c9a8, 0x40, 0x0, 0x0?, 0x6)
    /usr/local/go/src/runtime/traceback.go:499 +0xeb4 fp=0x400062c940 sp=0x400062c5e0 pc=0x1870004
runtime.sigprof(0x182eaec, 0x400062cc48?, 0x0?, 0x40000ccea0, 0x4000100c00)
    /usr/local/go/src/runtime/proc.go:4577 +0xe8 fp=0x400062cbb0 sp=0x400062c940 pc=0x1854608
runtime.sighandler(0x1b?, 0x4000100c00?, 0x400062cc98?, 0x40000ccea0?)
    /usr/local/go/src/runtime/signal_unix.go:629 +0x590 fp=0x400062cc60 sp=0x400062cbb0 pc=0x185fb70
runtime.sigtrampgo(0x1b, 0x400062cda0, 0x400062ce20)
    /usr/local/go/src/runtime/signal_unix.go:479 +0x164 fp=0x400062cce0 sp=0x400062cc60 pc=0x185f1e4
runtime.sigtramp()
    /usr/local/go/src/runtime/sys_linux_arm64.s:471 +0x58 fp=0x400062cda0 sp=0x400062cce0 pc=0x187f098

4. What is your TiDB version? (Required)

v6.5.2

XuHuaiyu commented 5 months ago

The error log does not seem to indicate a panic caused internally by TiDB; we suspect it is a bug in Golang. There is an issue on Golang's GitHub regarding a traceback stuck panic triggered by higher versions of Golang, but there is no conclusion yet: https://github.com/golang/go/issues/62086. We currently suspect that this issue is related and is caused by the CPU profiling.

XuHuaiyu commented 5 months ago

The user cluster has currently disabled the conprof feature, but the topsql feature is still enabled, which also calls the CPU profile interface. We have decided to add some debug logs and patch them into the user cluster to determine if the issue is related to topsql.

djshow832 commented 1 week ago

Another occurrence in another cluster:

runtime: traceback stuck. pc=0x20c9dc2 sp=0x424def7fb0
stack: frame={sp:0x424def7fb0, fp:0x424def7fb0} stack=[0x424def4000,0x424def8000)
0x000000424def7eb0:  0x0001000000000001  0x000000424a69f6e0 
0x000000424def7ec0:  0x00000040fdf80280  0x000000000522c1c0 
0x000000424def7ed0:  0x000000424a69f6e0  0x000000424ac97f08 
0x000000424def7ee0:  0x0000000001e6a550 <google.golang.org/grpc/internal/transport.(*loopyWriter).run+0x0000000000000060>  0x000000424a6fc060 
0x000000424def7ef0:  0x0000000000000000  0x000000424a6fc240 
0x000000424def7f00:  0x0000000000000000  0x0000000000000000 
0x000000424def7f10:  0x0000000001e73a74 <google.golang.org/grpc/internal/transport.newHTTP2Client.func6+0x0000000000000064>  0x000000424a69f6d0 
0x000000424def7f20:  0x00000000043e8301  0x00000040fdf80280 
0x000000424def7f30:  0x0000000000000000  0x0100004248e00d70 
0x000000424def7f40:  0x0000000000000000  0x0000000000000000 
0x000000424def7f50:  0x0000000001e6a790 <google.golang.org/grpc/internal/transport.(*loopyWriter).run.func1+0x0000000000000000>  0x000000424def7f40 
0x000000424def7f60:  0x000000424def7f50  0x0000000000000000 
0x000000424def7f70:  0x00000000019a9674 <runtime.goexit+0x0000000000000004>  0x000000424a6bf980 
0x000000424def7f80:  0x0000004248e00d10  0x000000424a69f6d0 
0x000000424def7f90:  0x0000000000000000  0x00000000052550f0 
0x000000424def7fa0:  0x0000004000462030  0x000000000274b4a0 <github.com/pingcap/tidb/util.(*WaitGroupWrapper).RunWithRecover.func1.1+0x0000000000000000> 
0x000000424def7fb0: >0x00000000020c9dc2 <github.com/pingcap/kvproto/pkg/pdpb.(*pDClient).Tso+0x00000000000000e2>  0x0000004247a38960 
0x000000424def7fc0:  0x0000000000000012  0x0000000000000000 
0x000000424def7fd0:  0x0000000000000000  0x0000000000000000 
0x000000424def7fe0:  0x0000000000000000  0x0000000000000000 
0x000000424def7ff0:  0x0000000000000000  0x0000000000000000 
fatal error: traceback stuck

goroutine 0 [idle]:
runtime.throw({0x49f3896?, 0x72855c0?})
    /usr/local/go/src/runtime/panic.go:1047 +0x40 fp=0x4001a7c5e0 sp=0x4001a7c5b0 pc=0x1973630
runtime.gentraceback(0x79917e0?, 0x0?, 0x4001a7c9a8?, 0x40019271e0, 0x0, 0x4001a7c9a8, 0x40, 0x0, 0x0?, 0x6)
    /usr/local/go/src/runtime/traceback.go:499 +0xeb4 fp=0x4001a7c940 sp=0x4001a7c5e0 pc=0x199b844
runtime.sigprof(0x1959fb4, 0x4001a7cc48?, 0x0?, 0x40019271e0, 0x4000fa8800)
    /usr/local/go/src/runtime/proc.go:4577 +0xe8 fp=0x4001a7cbb0 sp=0x4001a7c940 pc=0x197fe48
runtime.sighandler(0x1b?, 0x4000fa8800?, 0x4001a7cc98?, 0x40019271e0?)
    /usr/local/go/src/runtime/signal_unix.go:629 +0x590 fp=0x4001a7cc60 sp=0x4001a7cbb0 pc=0x198b3b0
runtime.sigtrampgo(0x1b, 0x4001a7cda0, 0x4001a7ce20)
    /usr/local/go/src/runtime/signal_unix.go:479 +0x164 fp=0x4001a7cce0 sp=0x4001a7cc60 pc=0x198aa24
runtime.sigtramp()
    /usr/local/go/src/runtime/sys_linux_arm64.s:471 +0x58 fp=0x4001a7cda0 sp=0x4001a7cce0 pc=0x19aa8d8

TiDB v6.5.4, go 1.19.12, ARM

topSQL is enabled

goroutine 337 [select, 1 minutes]:
runtime.gopark(0x40016f0f20?, 0x3?, 0xd8?, 0xd?, 0x40016f0efa?)
    /usr/local/go/src/runtime/proc.go:363 +0xe4 fp=0x40016f0da0 sp=0x40016f0d80 pc=0x19762b4
runtime.selectgo(0x40016f0f20, 0x40016f0ef4, 0x0?, 0x0, 0x3?, 0x1)
    /usr/local/go/src/runtime/select.go:328 +0x6b4 fp=0x40016f0eb0 sp=0x40016f0da0 pc=0x19876a4
github.com/pingcap/tidb/util/cpuprofile.(*parallelCPUProfiler).profilingLoop(0x400061e1e0)
    /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/cpuprofile/cpuprofile.go:164 +0xcc fp=0x40016f0f60 sp=0x40016f0eb0 pc=0x33de22c
github.com/pingcap/tidb/util/cpuprofile.(*parallelCPUProfiler).profilingLoop-fm()
    <autogenerated>:1 +0x28 fp=0x40016f0f80 sp=0x40016f0f60 pc=0x33dfdb8
github.com/pingcap/tidb/util.WithRecovery(0x0?, 0x4000d90360?)
    /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/misc.go:96 +0x50 fp=0x40016f0fb0 sp=0x40016f0f80 pc=0x2743440
github.com/pingcap/tidb/util/cpuprofile.(*parallelCPUProfiler).start.func1()
    /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/cpuprofile/cpuprofile.go:111 +0x2c fp=0x40016f0fd0 sp=0x40016f0fb0 pc=0x33ddc8c
runtime.goexit()
    /usr/local/go/src/runtime/asm_arm64.s:1172 +0x4 fp=0x40016f0fd0 sp=0x40016f0fd0 pc=0x19a9674
created by github.com/pingcap/tidb/util/cpuprofile.(*parallelCPUProfiler).start
    /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/cpuprofile/cpuprofile.go:111 +0x25c