Open SeanTAllen opened 3 years ago
The host.cc change makes no difference when we are building using musl rather than glibc. We still get segfaults in the debug build on Arm-64 when using musl.
The tests that fail seem to be repeatable from run to run. I've been working through the builtin tests commenting out the ones that crash and then another crashes but it is always one that wasn't run previously. I'm building LLVM on a 64 bit raspberry pi 4 to see if it is reproducible there or if we are in for a tricky time getting more info.
The changes are on the branch 3874
.
Ok this is reproducible with the same tests failing on Raspberry PI 4 64 bit.
Well this is fascinating...
(gdb) bt
#0 0x00000055559e98b0 in pony_test_TestHelper_val__format_loc_oo (this=0x7fe6d016a0, loc=0x0)
at /home/pi/code/ponylang/ponyc/packages/pony_test/test_helper.pony:362
#1 0x00000055559e4fc4 in pony_test_TestHelper_val_assert_error_ooob (this=0x7fddce1c40, test=0x555640a108, msg=0x5556410380, loc=0x55564077c0)
at /home/pi/code/ponylang/ponyc/packages/pony_test/test_helper.pony:95
#2 0x0000005555b49434 in builtin_test__TestArrayInsert_ref_apply_oo (this=0x7fe65001e0, h=0x7fddce1c40)
at /home/pi/code/ponylang/ponyc/packages/builtin_test/_test.pony:1512
#3 0x00000055557bef9c in pony_test__TestRunner_tag_run_o (this=0x7fe649dc00)
at /home/pi/code/ponylang/ponyc/packages/pony_test/_test_runner.pony:63
#4 0x0000005555764728 in pony_test__TestRunner_Dispatch ()
#5 0x0000005555c6993c in ponyint_actor_run ()
#6 0x0000005555c73d28 in run_thread ()
#7 0x0000007ff7fa67e4 in start_thread (arg=0x7ffffff16f) at pthread_create.c:486
#8 0x0000007ff7dd6adc in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
@jemc this is one for you and i to talk about.
The backtrace looks identical to what we were getting on the M1. I can't remember the root cause, but maybe @jemc can remember.
Edit: some of these threads might help.
https://ponylang.zulipchat.com/#narrow/stream/275038-M1/topic/exception.20handling.20ABI.20bug
https://ponylang.zulipchat.com/#narrow/stream/275038-M1/topic/strip.20debug.20bug
Although on second thought, it seems like the problem with the backtrace above is that the location object is null. I'd look at the IR to check if we're getting any undef
references for that.
So it looks like:
#0 0x00000055559e98b0 in pony_test_TestHelper_val__format_loc_oo (this=0x7fe6d016a0, loc=0x0)
at /home/pi/code/ponylang/ponyc/packages/pony_test/test_helper.pony:362
#1 0x00000055559e4fc4 in pony_test_TestHelper_val_assert_error_ooob (this=0x7fddce1c40, test=0x555640a108, msg=0x5556410380, loc=0x55564077c0)
at /home/pi/code/ponylang/ponyc/packages/pony_test/test_helper.pony:95
We have the location in frame 1. But its null in frame 2.
fun assert_error(test: ITest box, msg: String = "", loc: SourceLoc = __loc)
: Bool
=>
"""
Assert that the given test function throws an error when run.
"""
try
test()?
fail(_format_loc(loc) + "Assert error failed. " + msg)
false
else
log(_format_loc(loc) + "Assert error passed. " + msg, true)
true
end
Its a straight passthrough. I suspect that the reason that having some optimization causes the crash to go away is that we inline the very small _format_loc
method and so whatever our weirdness is for loc
being null goes away. But I have no actual evidence of that yet, its purely a guess. Really thought the important question is "how the hell does loc end up null".
Here's _format_loc
for edification:
fun _format_loc(loc: SourceLoc): String =>
loc.file() + ":" + loc.line().string() + ": "
I have determined that the failure happens with a more minimal example as well:
use "pony_test"
use "collections"
actor \nodoc\ Main is TestList
new create(env: Env) => PonyTest(env, this)
new make() => None
fun tag tests(test: PonyTest) =>
test(_TestArrayFind)
class \nodoc\ iso _TestArrayFind is UnitTest
fun name(): String => "builtin/Array.find"
fun apply(h: TestHelper) =>
let a = [as ISize: 0; 1; 2; 3; 4; 1]
h.assert_error({() ? => a.find(6)? })
And changing so that a.find
won't error doesn't result in a crash:
use "pony_test"
use "collections"
actor \nodoc\ Main is TestList
new create(env: Env) => PonyTest(env, this)
new make() => None
fun tag tests(test: PonyTest) =>
test(_TestArrayFind)
class \nodoc\ iso _TestArrayFind is UnitTest
fun name(): String => "builtin/Array.find"
fun apply(h: TestHelper) =>
let a = [as ISize: 0; 1; 2; 3; 4; 1]
h.assert_error({() ? => a.find(1)? })
Linting the llvm with --lint-llvm
results in:
Pessimization: Static alloca outside of entry block
%22 = alloca i8*, align 8
Pessimization: Static alloca outside of entry block
%22 = alloca i8*, align 8
Pessimization: Static alloca outside of entry block
%22 = alloca i8*, align 8
Pessimization: Static alloca outside of entry block
%22 = alloca i8*, align 8
Pessimization: Static alloca outside of entry block
%22 = alloca i8*, align 8
Pessimization: Static alloca outside of entry block
%22 = alloca i8*, align 8
Pessimization: Static alloca outside of entry block
%22 = alloca i8*, align 8
Pessimization: Static alloca outside of entry block
%22 = alloca i8*, align 8
Pessimization: Static alloca outside of entry block
%22 = alloca i8*, align 8
but this happens on x86 where there is no issue as well.
--verify
reports no issues.
Here's a more minimal example:
actor Main
new create(env: Env) =>
foo("hello")
fun foo(s: String) =>
try
one(s)
error
else
one(s)
end
fun one(s: String) =>
s.clone()
@ergl could you see if this fails on an M1? The key seems to be to use a release runtime but do -d
when compiling the program.
@SeanTAllen No errors for me. No combination of release/debug runtime and setting -d
or leaving it off had the issue for the two examples you posted.
Interesting so this might be aarch64 + linux issue.
Note that if it's like the M1 issue, then the problem goes away if you link the program against the runtime bitcode (which I still think should be the default build flow where available).
@ergl can you try on m1 with the host.cc change removed? so that debug is doing no optimization?
@SeanTAllen and I think this is related to us needing to write some ARM64-specific code in posix_except.c
. We currently have ARM32-specific code, and then catch-all code for everything else.
Here's a PDF that includes some info on the ARM64 Exception Handling ABI: https://documentation-service.arm.com/static/5fa190a7b1a7c5445f2904d6?token=
That PDF links to this doc, which seems to have the more relevant information: https://itanium-cxx-abi.github.io/cxx-abi/abi-eh.html
@sylvanc says he is believes that yes, the current usage of catch all code is probably wrong for arm exceptions.
but when i told him that aarch64 docs say that it is the same as itanium. he said itanium is the same as x86 except perhaps some low level register setting.
@SeanTAllen - try this diff:
diff --git a/src/libponyc/codegen/codegen.c b/src/libponyc/codegen/codegen.c
index 6a972f88..147435b5 100644
--- a/src/libponyc/codegen/codegen.c
+++ b/src/libponyc/codegen/codegen.c
@@ -265,6 +265,7 @@ static void init_runtime(compile_t* c)
unsigned int align_value = target_is_ilp32(c->opt->triple) ? 4 : 8;
+ LLVM_DECLARE_ATTRIBUTEREF(uwtable_attr, uwtable, 0);
LLVM_DECLARE_ATTRIBUTEREF(nounwind_attr, nounwind, 0);
LLVM_DECLARE_ATTRIBUTEREF(readnone_attr, readnone, 0);
LLVM_DECLARE_ATTRIBUTEREF(readonly_attr, readonly, 0);
@@ -606,7 +607,9 @@ static void init_runtime(compile_t* c)
// i32 ponyint_personality_v0(...)
type = LLVMFunctionType(c->i32, NULL, 0, true);
- c->personality = LLVMAddFunction(c->module, "ponyint_personality_v0", type);
+ value = LLVMAddFunction(c->module, "ponyint_personality_v0", type);
+ c->personality = value;
+ LLVMAddAttributeAtIndex(value, LLVMAttributeFunctionIndex, uwtable_attr);
// i32 memcmp(i8*, i8*, intptr)
params[0] = c->void_ptr;
@jemc same boom
Interestingly, this also segfaults because of the message send that originates in the else
of the try
:
actor Main
let _out: OutStream
new create(env: Env) =>
_out = env.out
foo("hello")
fun foo(s: String) =>
try
one("h")
error
else
one("e")
end
fun one(s: String) =>
_out.print(s)
but this does not:
actor Main
let _out: OutStream
new create(env: Env) =>
_out = env.out
foo("hello")
fun foo(s: String) =>
try
one("h")
error
else
one("e")
end
fun one(s: String) =>
s.clone()
I discussed this with @sylvanc. He took a look at the pony_try llvm ir that we use and said it looked correct and the personality function looked correct as well. He suggested looking at a C++ aarch64 personality function to see if it was different in a way we weren't aware of.
I'll be taking a look at the one from the LLVM project:
https://github.com/llvm/llvm-project/blob/main/libcxxabi/src/cxa_personality.cpp
I've found a difference in our personality function from the c++ abi.
When we set registers, we do not do what the C++ does:
_Unwind_SetGR(context, __builtin_eh_return_data_regno(1),
static_cast<uintptr_t>(results.ttypeIndex));
It is setting context to the type index found from the lsda scan.
Whereas we do:
_Unwind_SetGR(context, __builtin_eh_return_data_regno(1), 0);
Additionally there are differences in ordering in the personality functions which might be because we need the scan results for use elsewhere or simply different way of writing the code. The pony code is clearer, but maybe not correct in addition to the register issue.
So we get the ttypeIndex from:
/// Read a sleb128 encoded value and advance pointer
/// See Variable Length Data Appendix C in:
/// @link http://dwarfstd.org/Dwarf4.pdf @unlink
/// @param data reference variable holding memory pointer to decode from
/// @returns decoded value
static
intptr_t
readSLEB128(const uint8_t** data)
plus there's additional logic for deciding it to actually set that in results that I haven't fully worked through.
I changed this to "triggers release" as this is actually incorrect exception handling on aarch64.
In case anything here is helpful: https://llvm.org/docs/ExceptionHandling.html#exception-handling-support-on-the-target
I talked through this with Sylvan and we came to the conclusion that this is a probably a bug in LLVM like I initially thought. In particular a bug in the aarch64 code that happens to get fixed by some optimization. Because the LLVM ir is the same for x86 debug and aarch64 debug besides the platform things you'd expect to be different and we aren't doing anything special for aarch64 vs x86 so it isn't a bug in what we are doing AND as joe has seen, if you create bitcode and link with the llvm tools, the bug also goes away so... bug somewhere in llvm for aarch64.
I've been thinking about this issue and something new occurs to me.
Why does this only happen with a release/optimized version of the runtime? Why is it fine with a debug runtime + debug pony binary? That seems very odd. I have a feeling that perhaps there's something there that will help us find the issue.
I retested this and it looks like it effects both debug and release versions of the runtime, which was either overlooked when i first opened this issue or is new. Either way, it makes me feel a little better that it isn't only happening with release versions of the runtime.
While getting pony working on 64-bit Arm, we found a bug where the
test-stdlib-debug
tests would segfault. Everything worked fine withrelease
builds, butdebug
builds would crash.After investigation, we've found that in
codegen_machine
inhost.cc
if theopt_level
is get to eitherCodeGenOpt::Default
orCodeGenOpt::Aggressive
instead ofCodeGenOpt::None
the problem goes away.It seems likely that this is an LLVM bug, but that hasn't been established. We've committed a fix in
host.cc
that when the target isarm
(which is 64-bit arm) that we useCodeGenOpt::Default
. This is far from ideal, but gets us working compilations until we can investigate further.The end result of this should either be a patch that we upstream to LLVM or possibly, the discovery that we are doing something wrong with our Arm code generation that optimization manages to fix.