pingcap / tiflash

The analytical engine for TiDB and TiDB Cloud. Try free: https://tidbcloud.com/free-trial
https://docs.pingcap.com/tidb/stable/tiflash-overview
Apache License 2.0
941 stars 410 forks source link

TiFlash crash when scale out and run query #6646

Closed hehechen closed 5 months ago

hehechen commented 1 year ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

Scale out 10 TiFlash nodes and run TPCH queries at the same time.

2. What did you expect to see? (Required)

Don't crash.

3. What did you see instead (Required)

Some TiFlash nodes crashed.

[2023/01/16 22:13:09.443 +00:00] [ERROR] [BaseDaemon.cpp:377] [########################################] [source=BaseDaemon] [thread_id=1025] [2023/01/16 22:13:09.443 +00:00] [ERROR] [BaseDaemon.cpp:378] ["(from thread 1024) Received signal Segmentation fault(11)."] [source=BaseDaemon] [thread_id=1025] [2023/01/16 22:13:09.443 +00:00] [ERROR] [BaseDaemon.cpp:408] ["Address: 0x8"] [source=BaseDaemon] [thread_id=1025] [2023/01/16 22:13:09.443 +00:00] [ERROR] [BaseDaemon.cpp:414] ["Access: read."] [source=BaseDaemon] [thread_id=1025] [2023/01/16 22:13:09.443 +00:00] [ERROR] [BaseDaemon.cpp:423] ["Address not mapped to object."] [source=BaseDaemon] [thread_id=1025]

4. What is your TiFlash version? (Required)

v6.5.0 41c08dbe20901f6cfd28ce642b39ce53f35ef48a

hehechen commented 1 year ago

crash_tiflash_log_14.tar.gz

JaySon-Huang commented 1 year ago

Weirdly, there is no stack frame output to the log file...

CalvinNeo commented 1 year ago

It is quite strange that there are no logs for what thread_1024 and thread_1025 is doing before thread_1024 is crashed.

hehechen commented 1 year ago

I think it's better to write thread_name here. https://github.com/pingcap/tiflash/blob/a17c0bb2cd8c47e4ef48b881548dafd234f5ed42/libs/libdaemon/src/BaseDaemon.cpp#L229

hehechen commented 1 year ago

I think it's better to write thread_name here.

https://github.com/pingcap/tiflash/blob/a17c0bb2cd8c47e4ef48b881548dafd234f5ed42/libs/libdaemon/src/BaseDaemon.cpp#L229

pthread_getname_np is not asnyc-safe so we can't use it in signal handler.

JaySon-Huang commented 1 year ago

I've got a coredump file with the same error.

[2023/01/19 02:27:07.847 +00:00] [ERROR] [BaseDaemon.cpp:377] [########################################] [source=BaseDaemon] [thread_id=1006]
[2023/01/19 02:27:07.848 +00:00] [ERROR] [BaseDaemon.cpp:378] ["(from thread 1005) Received signal Segmentation fault(11)."] [source=BaseDaemon] [thread_id=1006]
[2023/01/19 02:27:07.848 +00:00] [ERROR] [BaseDaemon.cpp:408] ["Address: 0x8"] [source=BaseDaemon] [thread_id=1006]
[2023/01/19 02:27:07.848 +00:00] [ERROR] [BaseDaemon.cpp:414] ["Access: read."] [source=BaseDaemon] [thread_id=1006]
[2023/01/19 02:27:07.848 +00:00] [ERROR] [BaseDaemon.cpp:423] ["Address not mapped to object."] [source=BaseDaemon] [thread_id=1006]

Seems the crash is caused by error happens when tracing the stack by libunwind

> 5-85 /data1/jaysonhuang/qa/flash_debug
>  LD_LIBRARY_PATH=. gdb  ./tiflash ./core.1
GNU gdb (GDB) 8.2
...
Reading symbols from ./tiflash...done.
BFD: warning: /data1/jaysonhuang/qa/flash_debug/./core.1 is truncated: expected core file size >= 359614476288, found: 1075888128

warning: core file may not match specified executable file.
...
Core was generated by `/tiflash/tiflash server --config-file /data0/config.toml'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  access_mem (as=<optimized out>, addr=8, val=0x7fe902a408c8, write=<optimized out>, arg=<optimized out>)
    at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tics/contrib/libunwind/src/x86_64/Ginit.c:330
330 /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tics/contrib/libunwind/src/x86_64/Ginit.c: No such file or directory.
[Current thread is 1 (LWP 7)]
(gdb) bt
#0  access_mem (as=<optimized out>, addr=8, val=0x7fe902a408c8, write=<optimized out>, arg=<optimized out>)
    at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tics/contrib/libunwind/src/x86_64/Ginit.c:330
Backtrace stopped: Cannot access memory at address 0x7fe902a405f8
(gdb) info threads
  Id   Target Id         Frame
* 1    LWP 7             access_mem (as=<optimized out>, addr=8, val=0x7fe902a408c8, write=<optimized out>, arg=<optimized out>)
    at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tics/contrib/libunwind/src/x86_64/Ginit.c:330
  2    LWP 25            0x00007fe903b40c89 in ?? ()
  3    LWP 29            0x00007fe903b40c89 in ?? ()
  4    LWP 28            0x00007fe908607f00 in ?? ()
  5    LWP 37            0x00007fe904745de2 in ?? ()
  6    LWP 22            0x00007fe903b46f43 in ?? ()
  7    LWP 61            0x00007fe904745de2 in ?? ()
  8    LWP 160           0x0000000006f727a0 in LZ4_decompress_generic (src=<optimized out>, dst=<optimized out>, srcSize=<optimized out>, outputSize=<optimized out>, partialDecoding=decode_full_block,
    dict=noDict, lowPrefix=<optimized out>, dictStart=0x0, dictSize=0) at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tics/contrib/lz4/lib/lz4.c:2060
  9    LWP 156           LZ4_decompress_generic (src=<optimized out>, dst=<optimized out>, srcSize=<optimized out>, outputSize=<optimized out>, partialDecoding=decode_full_block, dict=noDict,
    lowPrefix=<optimized out>, dictStart=0x0, dictSize=0) at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tics/contrib/lz4/lib/lz4.c:2000
>  ./tiflash version
TiFlash
Release Version: v6.5.0
Edition:         Community
Git Commit Hash: 41c08dbe20901f6cfd28ce642b39ce53f35ef48a
Git Branch:      heads/refs/tags/v6.5.0
UTC Build Time:  2022-12-21 12:03:40
Enable Features: jemalloc sm4(GmSSL) avx2 avx512 unwind thinlto
Profile:         RELWITHDEBINFO

Raft Proxy
Git Commit Hash:   ea48821d77b57a276ce3a1363de8875c07d96756
Git Commit Branch: HEAD
UTC Build Time:    2022-12-21 12:08:30
Rust Version:      rustc 1.67.0-nightly (96ddd32c4 2022-11-14)
Storage Engine:    tiflash
Prometheus Prefix: tiflash_proxy_
Profile:           release

Yet another error log without valid stack info from coredump file

[2023/01/19 06:55:08.127 +00:00] [ERROR] [BaseDaemon.cpp:377] [########################################] [source=BaseDaemon] [thread_id=996]
[2023/01/19 06:55:08.127 +00:00] [ERROR] [BaseDaemon.cpp:378] ["(from thread 968) Received signal Segmentation fault(11)."] [source=BaseDaemon] [thread_id=996]
[2023/01/19 06:55:08.127 +00:00] [ERROR] [BaseDaemon.cpp:408] ["Address: 0x8"] [source=BaseDaemon] [thread_id=996]
[2023/01/19 06:55:08.127 +00:00] [ERROR] [BaseDaemon.cpp:414] ["Access: read."] [source=BaseDaemon] [thread_id=996]
[2023/01/19 06:55:08.127 +00:00] [ERROR] [BaseDaemon.cpp:423] ["Address not mapped to object."] [source=BaseDaemon] [thread_id=996]
hehechen commented 1 year ago

May it caused by continuous profiling?

hehechen commented 1 year ago

Didn't reproduce after disabling continuous profiling.

yongman commented 1 year ago

Any updates?

JaySon-Huang commented 5 months ago

closed as can not reproduced for a long time