Closed HarryLeeIBM closed 2 months ago
Could you please try 19 or main
branch?
@EugeneZelenko, tried clang-19, after fixing some compiling errors, it can compile but the issue still happens. Also tried clang-18, no luck either.
Updates:
Tried to turn off optimization of file src/Processors/Executors/PipelineExecutor.cpp
by adding #pragma clang optimize off
after #include
lines and adding #pragma clang optimize on
at the end of the file, issue disappears. So the optimization issue happens in this file.
@llvm/issue-subscribers-backend-powerpc
Author: Harry Lee (HarryLeeIBM)
Hi Harry, I'm having a look into this issue. I built ClickHouse with Clang18 on PPC64LE, however I didn't see the segmentation fault as expected. Instead, I got the following NETWORK_ERROR. What am I missing? I don't know much about ClickHouse, could you please help me point out my problem? Thanks!
$ ./clickhouse client
ClickHouse client version 24.8.1.1 (official build).
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 24.8.1.
Warnings:
* Linux is not using a fast clock source. Performance can be degraded. Check /sys/devices/system/clocksource/clocksource0/current_clocksource
ilum.aus.stglabs.ibm.com :)
ilum.aus.stglabs.ibm.com :) select if(in(dummy, tuple(0, 1)), 'ok', 'ok') from remote('localhost', system.one) settings legacy_column_name_of_tuple_literal=1, prefer_localhost_replica=0;
SELECT if(dummy IN (0, 1), 'ok', 'ok')
FROM remote('localhost', system.one)
SETTINGS legacy_column_name_of_tuple_literal = 1, prefer_localhost_replica = 0
Query id: 215ce427-7c26-42cb-a202-84946251ecb9
Error on processing query: Code: 32. DB::Exception: Attempt to read after eof: while receiving packet from localhost:9000. (ATTEMPT_TO_READ_AFTER_EOF) (version 24.8.1.1 (official build))
Connecting to localhost:9000 as user default.
Code: 210. DB::NetException: Connection refused (localhost:9000). (NETWORK_ERROR)
Oh well, ignore my previous comment. I just noticed the segmentation fault on server part.
Confirmed with:
src/Processors/Executors/PipelineExecutor.cpp
The case is good.
Will check further...
The error is caused by a wrong machineLICM for an xxlxor
instruction.
MachineLICM is not wrong. It is just a trigger for the error.
.LBB35_9: # %if.end
# in Loop: Header=BB35_8 Depth=2
mr 3, 27
# xxlxor 63, 63, 63 #bad
bl _ZN2DB22ExecutionThreadContext11executeTaskEv
nop
#xxlxor 63, 63, 63 #good
andi. 3, 3, 1
bc 12, 1, .LBB35_14
b .LBB35_10
xxlxor 63, 63, 63
is the instruction machineLICM hoists. It is hoisted to the entry block of the function _ZN2DB16PipelineExecutor15executeStepImplEmPNSt3__16atomicIbEE
.
I narrow it down to the place before/after bl _ZN2DB22ExecutionThreadContext11executeTaskEv
inside _ZN2DB16PipelineExecutor15executeStepImplEmPNSt3__16atomicIbEE
If xxlxor 63, 63, 63
is put before bl _ZN2DB22ExecutionThreadContext11executeTaskEv
, the binary crashes. Confirmed that after _ZN2DB22ExecutionThreadContext11executeTaskEv
, vs63 is changed which is wrong as vs63 is a callee save register. I further narrowed down to some function, but have not find out which instruction changes the vs63 and not restored in the function's epilogue.
Need more time to investigation.
@HarryLeeIBM I may leave for one/two days for some downstream works, will back to this after that.
@HarryLeeIBM Hi, this turns out to be source code issue, we need to handle vector CSR registers(and maybe float point CSR registers) in ClickHouse/contrib/boost/libs/context/src/asm
for PPC target, otherwise, the vector CSR registers will not be restored after calling to the assembly functions in this directory, like jump_fcontext
and ontop_fcontext
.
Below hack(only handle vs63 which is allocated by compiler, full solution is to handle all vector CSR registers) can make the case pass:
$ pwd
/ClickHouse/contrib/boost
$ git diff
diff --git a/libs/context/src/asm/jump_ppc64_sysv_elf_gas.S b/libs/context/src/asm/jump_ppc64_sysv_elf_gas.S
index 28907db32..f3b7a230b 100644
--- a/libs/context/src/asm/jump_ppc64_sysv_elf_gas.S
+++ b/libs/context/src/asm/jump_ppc64_sysv_elf_gas.S
@@ -97,7 +97,7 @@ jump_fcontext:
# endif
#endif
# reserve space on stack
- subi %r1, %r1, 184
+ subi %r1, %r1, 200
#if _CALL_ELF != 2
std %r2, 0(%r1) # save TOC
@@ -133,6 +133,10 @@ jump_fcontext:
# save LR as PC
std %r0, 176(%r1)
+ # save VS63
+ li %r31, 184
+ stvx %v31, %r1, %r31
+
# store RSP (pointing to context-data) in R6
mr %r6, %r1
@@ -145,6 +149,11 @@ jump_fcontext:
ld %r2, 0(%r1) # restore TOC
#endif
+
+ # restore VS63
+ li %r31, 184
+ lvx %v31, %r1, %r31
+
ld %r14, 8(%r1) # restore R14
ld %r15, 16(%r1) # restore R15
ld %r16, 24(%r1) # restore R16
@@ -180,7 +189,7 @@ jump_fcontext:
mtctr %r12
# adjust stack
- addi %r1, %r1, 184
+ addi %r1, %r1, 200
#if _CALL_ELF == 2
# copy transfer_t into transfer_fn arg registers
diff --git a/libs/context/src/asm/ontop_ppc64_sysv_elf_gas.S b/libs/context/src/asm/ontop_ppc64_sysv_elf_gas.S
index cd97f4567..f8954edcf 100644
--- a/libs/context/src/asm/ontop_ppc64_sysv_elf_gas.S
+++ b/libs/context/src/asm/ontop_ppc64_sysv_elf_gas.S
@@ -97,7 +97,7 @@ ontop_fcontext:
# endif
#endif
# reserve space on stack
- subi %r1, %r1, 184
+ subi %r1, %r1, 200
#if _CALL_ELF != 2
std %r2, 0(%r1) # save TOC
@@ -133,6 +133,10 @@ ontop_fcontext:
# save LR as PC
std %r0, 176(%r1)
+ # save VS63
+ li %r31, 184
+ stvx %v31, %r1, %r31
+
# store RSP (pointing to context-data) in R7
mr %r7, %r1
@@ -144,6 +148,10 @@ ontop_fcontext:
mr %r1, %r4
#endif
+ # restore VS63
+ li %r31, 184
+ lvx %v31, %r1, %r31
+
ld %r14, 8(%r1) # restore R14
ld %r15, 16(%r1) # restore R15
ld %r16, 24(%r1) # restore R16
@@ -203,7 +211,7 @@ return_to_ctx:
mtlr %r0
# adjust stack
- addi %r1, %r1, 184
+ addi %r1, %r1, 200
# jump to context
bctr
If no objection, I am going to close this issue as this is sources issue. Feel free to open if more info needed.
I used Ubuntu clang version 18.1.4 to build ClickHouse(v24.7.x) using cross-compiling for PowerPC64le platform and found ClickHouse crashes. When I build ClickHouse with -O0 option it doesn't crash, so it could be a wrong optimization issue.
Tried to turn off optimization of file
src/Processors/Executors/PipelineExecutor.cpp
by adding#pragma clang optimize off
after#include
lines and adding#pragma clang optimize on
at the end of the file, issue disappears.To reproduce the issue, use the following steps:
select if(in(dummy, tuple(0, 1)), 'ok', 'ok') from remote('localhost', system.one) settings legacy_column_name_of_tuple_literal=1, prefer_localhost_replica=0;
Then you will notice the server crashes and core dump is created. By analyzing the core dump, the stack trace is as following: