Open XuHuaiyu opened 5 months ago
The error log does not seem to indicate a panic caused internally by TiDB; we suspect it is a bug in Golang. There is an issue on Golang's GitHub regarding a traceback stuck panic triggered by higher versions of Golang, but there is no conclusion yet: https://github.com/golang/go/issues/62086. We currently suspect that this issue is related and is caused by the CPU profiling.
The user cluster has currently disabled the conprof feature, but the topsql feature is still enabled, which also calls the CPU profile interface. We have decided to add some debug logs and patch them into the user cluster to determine if the issue is related to topsql.
Another occurrence in another cluster:
runtime: traceback stuck. pc=0x20c9dc2 sp=0x424def7fb0
stack: frame={sp:0x424def7fb0, fp:0x424def7fb0} stack=[0x424def4000,0x424def8000)
0x000000424def7eb0: 0x0001000000000001 0x000000424a69f6e0
0x000000424def7ec0: 0x00000040fdf80280 0x000000000522c1c0
0x000000424def7ed0: 0x000000424a69f6e0 0x000000424ac97f08
0x000000424def7ee0: 0x0000000001e6a550 <google.golang.org/grpc/internal/transport.(*loopyWriter).run+0x0000000000000060> 0x000000424a6fc060
0x000000424def7ef0: 0x0000000000000000 0x000000424a6fc240
0x000000424def7f00: 0x0000000000000000 0x0000000000000000
0x000000424def7f10: 0x0000000001e73a74 <google.golang.org/grpc/internal/transport.newHTTP2Client.func6+0x0000000000000064> 0x000000424a69f6d0
0x000000424def7f20: 0x00000000043e8301 0x00000040fdf80280
0x000000424def7f30: 0x0000000000000000 0x0100004248e00d70
0x000000424def7f40: 0x0000000000000000 0x0000000000000000
0x000000424def7f50: 0x0000000001e6a790 <google.golang.org/grpc/internal/transport.(*loopyWriter).run.func1+0x0000000000000000> 0x000000424def7f40
0x000000424def7f60: 0x000000424def7f50 0x0000000000000000
0x000000424def7f70: 0x00000000019a9674 <runtime.goexit+0x0000000000000004> 0x000000424a6bf980
0x000000424def7f80: 0x0000004248e00d10 0x000000424a69f6d0
0x000000424def7f90: 0x0000000000000000 0x00000000052550f0
0x000000424def7fa0: 0x0000004000462030 0x000000000274b4a0 <github.com/pingcap/tidb/util.(*WaitGroupWrapper).RunWithRecover.func1.1+0x0000000000000000>
0x000000424def7fb0: >0x00000000020c9dc2 <github.com/pingcap/kvproto/pkg/pdpb.(*pDClient).Tso+0x00000000000000e2> 0x0000004247a38960
0x000000424def7fc0: 0x0000000000000012 0x0000000000000000
0x000000424def7fd0: 0x0000000000000000 0x0000000000000000
0x000000424def7fe0: 0x0000000000000000 0x0000000000000000
0x000000424def7ff0: 0x0000000000000000 0x0000000000000000
fatal error: traceback stuck
goroutine 0 [idle]:
runtime.throw({0x49f3896?, 0x72855c0?})
/usr/local/go/src/runtime/panic.go:1047 +0x40 fp=0x4001a7c5e0 sp=0x4001a7c5b0 pc=0x1973630
runtime.gentraceback(0x79917e0?, 0x0?, 0x4001a7c9a8?, 0x40019271e0, 0x0, 0x4001a7c9a8, 0x40, 0x0, 0x0?, 0x6)
/usr/local/go/src/runtime/traceback.go:499 +0xeb4 fp=0x4001a7c940 sp=0x4001a7c5e0 pc=0x199b844
runtime.sigprof(0x1959fb4, 0x4001a7cc48?, 0x0?, 0x40019271e0, 0x4000fa8800)
/usr/local/go/src/runtime/proc.go:4577 +0xe8 fp=0x4001a7cbb0 sp=0x4001a7c940 pc=0x197fe48
runtime.sighandler(0x1b?, 0x4000fa8800?, 0x4001a7cc98?, 0x40019271e0?)
/usr/local/go/src/runtime/signal_unix.go:629 +0x590 fp=0x4001a7cc60 sp=0x4001a7cbb0 pc=0x198b3b0
runtime.sigtrampgo(0x1b, 0x4001a7cda0, 0x4001a7ce20)
/usr/local/go/src/runtime/signal_unix.go:479 +0x164 fp=0x4001a7cce0 sp=0x4001a7cc60 pc=0x198aa24
runtime.sigtramp()
/usr/local/go/src/runtime/sys_linux_arm64.s:471 +0x58 fp=0x4001a7cda0 sp=0x4001a7cce0 pc=0x19aa8d8
TiDB v6.5.4, go 1.19.12, ARM
topSQL is enabled
goroutine 337 [select, 1 minutes]:
runtime.gopark(0x40016f0f20?, 0x3?, 0xd8?, 0xd?, 0x40016f0efa?)
/usr/local/go/src/runtime/proc.go:363 +0xe4 fp=0x40016f0da0 sp=0x40016f0d80 pc=0x19762b4
runtime.selectgo(0x40016f0f20, 0x40016f0ef4, 0x0?, 0x0, 0x3?, 0x1)
/usr/local/go/src/runtime/select.go:328 +0x6b4 fp=0x40016f0eb0 sp=0x40016f0da0 pc=0x19876a4
github.com/pingcap/tidb/util/cpuprofile.(*parallelCPUProfiler).profilingLoop(0x400061e1e0)
/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/cpuprofile/cpuprofile.go:164 +0xcc fp=0x40016f0f60 sp=0x40016f0eb0 pc=0x33de22c
github.com/pingcap/tidb/util/cpuprofile.(*parallelCPUProfiler).profilingLoop-fm()
<autogenerated>:1 +0x28 fp=0x40016f0f80 sp=0x40016f0f60 pc=0x33dfdb8
github.com/pingcap/tidb/util.WithRecovery(0x0?, 0x4000d90360?)
/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/misc.go:96 +0x50 fp=0x40016f0fb0 sp=0x40016f0f80 pc=0x2743440
github.com/pingcap/tidb/util/cpuprofile.(*parallelCPUProfiler).start.func1()
/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/cpuprofile/cpuprofile.go:111 +0x2c fp=0x40016f0fd0 sp=0x40016f0fb0 pc=0x33ddc8c
runtime.goexit()
/usr/local/go/src/runtime/asm_arm64.s:1172 +0x4 fp=0x40016f0fd0 sp=0x40016f0fd0 pc=0x19a9674
created by github.com/pingcap/tidb/util/cpuprofile.(*parallelCPUProfiler).start
/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/cpuprofile/cpuprofile.go:111 +0x25c
Bug Report
Please answer these questions before submitting your issue. Thanks!
1. Minimal reproduce step (Required)
No reproduce step
2. What did you expect to see? (Required)
No restart
3. What did you see instead (Required)
On March 19 and June 3, there were abnormal restarts of a tidb-server on different physical machines within the same user cluster. The business load was normal, and there were no anomalies in the tidb.log file. However, some abnormal information can be seen in the tidb_err.log file. A portion of the tidb_err.log is provided below, with the full file available in the attached document. This cluster uses ARM architecture machines, but out of the hundreds of ARM-based TiDB clusters the user has, only this particular cluster has experienced restarts.
4. What is your TiDB version? (Required)
v6.5.2