terralang / terra

Terra is a low-level system programming language that is embedded in and meta-programmed by the Lua programming language.
terralang.org
Other
2.72k stars 201 forks source link

AArch64 Linux test failures on LLVM 12 and newer #597

Open elliottslaughter opened 2 years ago

elliottslaughter commented 2 years ago

I've seen this pop up a couple of times, so I'm documenting it here so people know about it. On ARM (AArch64), LLVM versions >= 12 seem to experience higher test failure rates. I've seen this happen on both Linux (Graviton, NVIDIA Jetson) ~and macOS (Apple M1)~, so it seems to be a feature of ARM processors ~and not of a specific OS~.

Test pass rate on LLVM 14.0.0:

98% tests passed, 13 tests failed out of 549

Total Test time (real) =  30.15 sec

The following tests FAILED:
     65 - class.t (Failed)
     70 - class6.t (Failed)
     90 - coverage3.t (Failed)
    121 - defercond.t (SEGFAULT)
    209 - let2.t (SEGFAULT)
    343 - simpleglobal.t (Failed)
    396 - teststd.t (SEGFAULT)
    412 - vars.t (SEGFAULT)
    413 - vars2.t (SEGFAULT)
    415 - vecarith.t (SEGFAULT)
    417 - vecobj.t (Failed)
    424 - zeroreturn.t (SEGFAULT)
    425 - zeroreturn2.t (SEGFAULT)
Errors while running CTest

Test pass rate on LLVM 11.1.0:

99% tests passed, 4 tests failed out of 549

Total Test time (real) =  28.59 sec

The following tests FAILED:
     65 - class.t (Failed)
     70 - class6.t (Failed)
     90 - coverage3.t (Failed)
    417 - vecobj.t (Failed)
Errors while running CTest

This is with dcd2eff87224d878c6adb33f1b69453ccc0a4ddf on the following machine:

$ uname -a
Linux gcc80 4.9.201-tegra #1 SMP PREEMPT Fri Jul 9 08:56:59 PDT 2021 aarch64 aarch64 aarch64 GNU/Linux

$ lscpu
Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  1
Core(s) per socket:  2
Socket(s):           4
Vendor ID:           Nvidia
Model:               0
Model name:          ARMv8 Processor rev 0 (v8l)
Stepping:            0x0
CPU max MHz:         2265.6001
CPU min MHz:         115.2000
BogoMIPS:            62.50
L1d cache:           64K
L1i cache:           128K
L2 cache:            2048K
L3 cache:            4096K
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp

My best guess at a root cause is #485, because LLVM 12 is the version where we switched back to MCJIT after ORCv1 was removed. However, it's possible something else changed in LLVM that is breaking something, and we haven't accounted for it yet.

Right now the best workaround is to stick to LLVM 11 on ARM platforms.

shoe42 commented 2 years ago

Not sure how useful this is, but here's some quick tests (using 23559ea), macOS on M1. Exporting both SDKROOT and INCLUDE_PATH, using /usr/bin/clang++ and LLVM 11 through 14 from Homebrew, I get the following:

Total Test time (real) = 16.96 sec

The following tests FAILED: 27 - atomicrmw.t (Subprocess aborted) 46 - bug372.t (SEGFAULT) 48 - bug372b.t (SEGFAULT) 49 - bug372c.t (SEGFAULT) 60 - cconv_array.t (Failed) 65 - class.t (SEGFAULT) 67 - class3.t (SEGFAULT) 272 - fakeasm.t (Failed) 423 - printfarray.t (SEGFAULT) 533 - vararg.t (Subprocess aborted)

- LLVM 12

98% tests passed, 10 tests failed out of 549

Total Test time (real) = 16.90 sec

The following tests FAILED: 27 - atomicrmw.t (SEGFAULT) 46 - bug372.t (SEGFAULT) 48 - bug372b.t (SEGFAULT) 49 - bug372c.t (SEGFAULT) 60 - cconv_array.t (Failed) 65 - class.t (SEGFAULT) 67 - class3.t (SEGFAULT) 272 - fakeasm.t (Failed) 423 - printfarray.t (SEGFAULT) 533 - vararg.t (Subprocess aborted)


- LLVM 13

98% tests passed, 11 tests failed out of 549

Total Test time (real) = 16.96 sec

The following tests FAILED: 27 - atomicrmw.t (Subprocess aborted) 46 - bug372.t (SEGFAULT) 48 - bug372b.t (SEGFAULT) 49 - bug372c.t (SEGFAULT) 50 - bug372d.t (SEGFAULT) 60 - cconv_array.t (Failed) 65 - class.t (SEGFAULT) 67 - class3.t (SEGFAULT) 272 - fakeasm.t (Failed) 423 - printfarray.t (SEGFAULT) 533 - vararg.t (Subprocess aborted)


- LLVM 14

98% tests passed, 11 tests failed out of 549

Total Test time (real) = 15.97 sec

The following tests FAILED: 27 - atomicrmw.t (SEGFAULT) 46 - bug372.t (SEGFAULT) 48 - bug372b.t (SEGFAULT) 49 - bug372c.t (SEGFAULT) 51 - bug4.t (SEGFAULT) 60 - cconv_array.t (Failed) 65 - class.t (SEGFAULT) 67 - class3.t (SEGFAULT) 272 - fakeasm.t (Failed) 423 - printfarray.t (SEGFAULT) 533 - vararg.t (SEGFAULT)

elliottslaughter commented 2 years ago

Interesting, thanks. I guess the issue identified in the title here is really a Linux problem, so that means the issue on Apple M1 is really somewhere else.

Thanks for taking the time to test this.

shoe42 commented 2 years ago

Here's some more (hopefully useful) information gleaned from running each failing test individually in lldb. As before, using commit 23559ea, macOS on M1, and LLVM 13 from Homebrew.

elliottslaughter commented 2 years ago

cconv_array.t seems to be straight up a case of something being not yet implemented in LuaJIT. I don't think there's anything we can do for that one except turn the test off for when we're on M1 macOS.

Based on the others, it looks like a lot of them are getting caught in strlen, which perhaps indicates an issue with memory layout. But beyond that it's not obvious to me what the issue is. Maybe LLVM is doing an optimization that breaks type punning, but that's a random guess.

This is a shot in the dark, but what happens when you change {fastmath=false} to false in:

https://github.com/terralang/terra/blob/0cf6be68edccc8501991d768cc9b939982d78194/src/terralib.lua#L834

That should completely shut off optimizations in the JIT.

For these tests, I think what it will take to make progress is to minimize each test: i.e., delete lines from the test until the the failure behavior changes. Once we've got the smallest version of each test that fails with the same behavior as the original, then we can (hopefully) see what specifically it is about that test that is causing trouble, and then form a hypothesis.

shoe42 commented 2 years ago

Replacing {fastmath=false} to just false seems to have no effect - looks like the same tests are failing. Backtraces for the strlen issues (in the below example, specifically running bug372d.t) traces to a point outside of any loaded sections:

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0xfff9000000000000)
  * frame #0: 0x00000001a3194864 libsystem_platform.dylib`_platform_strlen + 4
    frame #1: 0x00000001a30436a0 libsystem_c.dylib`__vfprintf + 4544
    frame #2: 0x00000001a30531d8 libsystem_c.dylib`vfprintf_l + 68
    frame #3: 0x00000001a3070f38 libsystem_c.dylib`printf + 80
    frame #4: 0x00000001095e80f8
(lldb) image lookup -va 0x00000001095e80f8
(lldb)

Looks like the code at frame 4 might be generated from Terra's JIT. Setting a breakpoint on the address shows it's at least written to by llvm::RuntimeDyldImpl::emitSection:

(lldb) w s e 0x00000001095e80f8
Watchpoint created: Watchpoint 1: addr = 0x1095e80f8 size = 8 state = enabled type = w
(lldb) c
Process 62850 resuming

...

Process 62850 stopped

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BREAKPOINT (code=258, subcode=0x1095e80f0)
  * frame #0: 0x00000001a3197184 libsystem_platform.dylib`_platform_memmove + 180
    frame #1: 0x0000000101491784 terra`llvm::RuntimeDyldImpl::emitSection(llvm::object::ObjectFile const&, llvm::object::SectionRef const&, bool) + 756
    frame #2: 0x0000000101490a08 terra`llvm::RuntimeDyldImpl::findOrEmitSection(llvm::object::ObjectFile const&, llvm::object::SectionRef const&, bool, std::__1::map<llvm::object::SectionRef, unsigned int, std::__1::less<llvm::object::SectionRef>, std::__1::allocator<std::__1::pair<llvm::object::SectionRef const, unsigned int> > >&) + 120
    frame #3: 0x000000010148f704 terra`llvm::RuntimeDyldImpl::loadObjectImpl(llvm::object::ObjectFile const&) + 1952
    frame #4: 0x00000001014a4570 terra`llvm::RuntimeDyldMachO::loadObject(llvm::object::ObjectFile const&) + 52
    frame #5: 0x0000000101493824 terra`llvm::RuntimeDyld::loadObject(llvm::object::ObjectFile const&) + 800
    frame #6: 0x000000010141e43c terra`llvm::MCJIT::generateCodeForModule(llvm::Module*) + 208
    frame #7: 0x000000010141f128 terra`llvm::MCJIT::findSymbol(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, bool) + 544
    frame #8: 0x000000010141ed98 terra`llvm::MCJIT::getSymbolAddress(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, bool) + 124
    frame #9: 0x000000010141f3ac terra`llvm::MCJIT::getGlobalValueAddress(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 52
    frame #10: 0x0000000100050c14 terra`JITGlobalValue(CU=0x0000600002c04100, gv=0x0000600002c10d88) at tcompiler.cpp:3812:24
    frame #11: 0x0000000100009574 terra`terra_jit(L=0x000000010a800380) at tcompiler.cpp:3820:17
    frame #12: 0x0000000103070388 terra`lj_BC_FUNCC + 44
    frame #13: 0x0000000103017ddc terra`lua_pcall(L=<unavailable>, nargs=<unavailable>, nresults=<unavailable>, errfunc=<unavailable>) at lj_api.c:1145:12 [opt]
    frame #14: 0x00000001000052dc terra`docall(L=0x000000010a800380, narg=0, clear=0) at main.cpp:339:14
    frame #15: 0x0000000100004e84 terra`main(argc=2, argv=0x000000016fdfee90) at main.cpp:119:13
    frame #16: 0x000000010936108c dyld`start + 520
(lldb) w delete 1
1 watchpoints deleted.
(lldb) c
Process 62850 resuming

...

Expect this ouput:
Process 62850 stopped

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0xfff9000000000000)
  * frame #0: 0x00000001a3194864 libsystem_platform.dylib`_platform_strlen + 4
    frame #1: 0x00000001a30436a0 libsystem_c.dylib`__vfprintf + 4544
    frame #2: 0x00000001a30531d8 libsystem_c.dylib`vfprintf_l + 68
    frame #3: 0x00000001a3070f38 libsystem_c.dylib`printf + 80
    frame #4: 0x00000001095e80f8
elliottslaughter commented 2 years ago

I think bug372.t would be the best version of this test to debug, assuming it still fails, since it's the shortest.

Unfortunately, debug info does not seem to be functioning on ARM. (You can try with terra -g bug372.t, but don't get your hopes up.)

So I think the next best thing we can do is "hardhat" debugging, which in this case probably is going to involve printf on all of the pointers in the program.

shoe42 commented 2 years ago

Not sure if relevant, but e.g. in bug372d.t, dumping the following addresses:

C.printf("Config:               0x%0.16lx\n", [long](&(config)))
C.printf("xBCLeftInflowProfile: 0x%0.16lx\n", [long](&(config.BC.xBCLeftInflowProfile.u.File.FileDir)))
C.printf("xBCLeftHeat:          0x%0.16lx\n", [long](&(config.BC.xBCLeftHeat.u.File.FileDir)))

Gives the following output:

Config:               0xfff9000000000000
xBCLeftInflowProfile: 0xfff9000000000000
xBCLeftHeat:          0xfff9000000000000

I'm far from even remotely familiar with Terra and its syntax, but is this expected? Shouldn't the FileDirs be offset into the Config struct?

elliottslaughter commented 2 years ago

Is a long 32 bits on M1? It may simply be that you're throwing away the relevant bits of the address.

Better to use %p and pass the pointer:

C.printf("Config:               0x%0.16p\n", &(config))
C.printf("xBCLeftInflowProfile: 0x%0.16p\n", &(config.BC.xBCLeftInflowProfile.u.File.FileDir))
C.printf("xBCLeftHeat:          0x%0.16p\n", &(config.BC.xBCLeftHeat.u.File.FileDir))
shoe42 commented 2 years ago

Using %p still outputs just 0x0xfff9000000000000. Something doesn't seem right - clang reports long to be 8 bytes.

elliottslaughter commented 2 years ago

Ok, well there is definitely something wrong because those pointers should be different addresses. E.g. here's output on x86_64:

Config:               0x0x00007ff7b331cd08
xBCLeftInflowProfile: 0x0x00007ff7b331cf40
xBCLeftHeat:          0x0x00007ff7b331cd0c

Maybe try this test? This should show us the LLVM IR before and after optimizations, plus the assembly, so maybe we'll spot something in here.

local C = terralib.includecstring
[[
#include "stdbool.h"
#include "stdint.h"
#include "stdlib.h"
#include "stdio.h"
#include "string.h"

typedef int TempProfile;
#define TempProfile_File 0

typedef int InflowProfile;
#define InflowProfile_File 0
#define InflowProfile_SuctionAndBlowing 1

struct Config {
  struct  {
    struct  {
      int32_t type;
      union  {
        struct  {
          int8_t FileDir[256];
        } File;
      } u;
    } xBCLeftHeat;
    struct  {
      int32_t type;
      union  {
        struct  {
          double addedVelocity;
          int8_t FileDir[256];
        } File;
        struct  {
          double sigma;
          struct  {
            uint32_t length;
            double values[10];
          } beta;
          double Zw;
          struct  {
            uint32_t length;
            double values[10];
          } A;
          struct  {
            uint32_t length;
            double values[10];
          } omega;
        } SuctionAndBlowing;
      } u;
    } yBCLeftInflowProfile;
    struct  {
      int32_t type;
      union  {
        struct  {
          double addedVelocity;
          int8_t FileDir[256];
        } File;
        struct  {
          double sigma;
          struct  {
            uint32_t length;
            double values[10];
          } beta;
          double Zw;
          struct  {
            uint32_t length;
            double values[10];
          } A;
          struct  {
            uint32_t length;
            double values[10];
          } omega;
        } SuctionAndBlowing;
      } u;
    } xBCLeftInflowProfile;
  } BC;
};
]]

local Config = C.Config

terra main()
  var config : Config
  C.printf("Config:               0x%0.16p\n", &(config))
  C.printf("xBCLeftInflowProfile: 0x%0.16p\n", &(config.BC.xBCLeftInflowProfile.u.File.FileDir))
  C.printf("xBCLeftHeat:          0x%0.16p\n", &(config.BC.xBCLeftHeat.u.File.FileDir))
end
print(terralib.saveobj(nil, "llvmir", {main=main}, nil, nil, false))
main:disas()
shoe42 commented 2 years ago

Output from the above:

define dso_local void @"$main"() {
entry:
  %config = alloca %struct.Config, align 8
  %0 = call i32 (i8*, %struct.Config*, ...) bitcast (i32 (i8*, ...)* @printf to i32 (i8*, %struct.Config*, ...)*)(i8* getelementptr inbounds ([32 x i8], [32 x i8]* @"$string", i64 0, i64 0), %struct.Config* nonnull %config)
  %1 = getelementptr inbounds %struct.Config, %struct.Config* %config, i64 0, i32 0, i32 2, i32 1, i32 0, i32 1
  %2 = bitcast %struct.anon.8* %1 to [256 x i8]*
  %3 = call i32 (i8*, [256 x i8]*, ...) bitcast (i32 (i8*, ...)* @printf to i32 (i8*, [256 x i8]*, ...)*)(i8* getelementptr inbounds ([32 x i8], [32 x i8]* @"$string.1", i64 0, i64 0), [256 x i8]* nonnull %2)
  %4 = getelementptr inbounds %struct.Config, %struct.Config* %config, i64 0, i32 0, i32 0, i32 1, i32 0, i32 0
  %5 = call i32 (i8*, [256 x i8]*, ...) bitcast (i32 (i8*, ...)* @printf to i32 (i8*, [256 x i8]*, ...)*)(i8* getelementptr inbounds ([32 x i8], [32 x i8]* @"$string.2", i64 0, i64 0), [256 x i8]* nonnull %4)
  ret void
}
; ModuleID = 'terra'
source_filename = "terra"
target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"
target triple = "arm64-apple-darwin21.6.0"

%struct.Config = type { %struct.anon.1 }
%struct.anon.1 = type { %struct.anon.2, %struct.anon.4, %struct.anon.4 }
%struct.anon.2 = type { i32, %union.anon }
%union.anon = type { %struct.anon.3 }
%struct.anon.3 = type { [256 x i8] }
%struct.anon.4 = type { i32, %union.anon.5 }
%union.anon.5 = type { %struct.anon.7 }
%struct.anon.7 = type { double, %struct.anon.8, double, %struct.anon.8, %struct.anon.8 }
%struct.anon.8 = type { i32, [10 x double] }
%struct.anon.6 = type { double, [256 x i8] }

@"$string" = private unnamed_addr constant [32 x i8] c"Config:               0x%0.16p\0A\00", align 1
@"$string.1" = private unnamed_addr constant [32 x i8] c"xBCLeftInflowProfile: 0x%0.16p\0A\00", align 1
@"$string.2" = private unnamed_addr constant [32 x i8] c"xBCLeftHeat:          0x%0.16p\0A\00", align 1

define dso_local void @main() {
entry:
  %config = alloca %struct.Config, align 8
  %0 = call i32 (i8*, %struct.Config*, ...) bitcast (i32 (i8*, ...)* @printf to i32 (i8*, %struct.Config*, ...)*)(i8* getelementptr inbounds ([32 x i8], [32 x i8]* @"$string", i32 0, i32 0), %struct.Config* %config)
  %1 = getelementptr %struct.Config, %struct.Config* %config, i32 0, i32 0
  %2 = getelementptr %struct.anon.1, %struct.anon.1* %1, i32 0, i32 2
  %3 = getelementptr %struct.anon.4, %struct.anon.4* %2, i32 0, i32 1
  %4 = getelementptr %union.anon.5, %union.anon.5* %3, i32 0, i32 0
  %5 = bitcast %struct.anon.7* %4 to %struct.anon.6*
  %6 = getelementptr %struct.anon.6, %struct.anon.6* %5, i32 0, i32 1
  %7 = call i32 (i8*, [256 x i8]*, ...) bitcast (i32 (i8*, ...)* @printf to i32 (i8*, [256 x i8]*, ...)*)(i8* getelementptr inbounds ([32 x i8], [32 x i8]* @"$string.1", i32 0, i32 0), [256 x i8]* %6)
  %8 = getelementptr %struct.Config, %struct.Config* %config, i32 0, i32 0
  %9 = getelementptr %struct.anon.1, %struct.anon.1* %8, i32 0, i32 0
  %10 = getelementptr %struct.anon.2, %struct.anon.2* %9, i32 0, i32 1
  %11 = getelementptr %union.anon, %union.anon* %10, i32 0, i32 0
  %12 = getelementptr %struct.anon.3, %struct.anon.3* %11, i32 0, i32 0
  %13 = call i32 (i8*, [256 x i8]*, ...) bitcast (i32 (i8*, ...)* @printf to i32 (i8*, [256 x i8]*, ...)*)(i8* getelementptr inbounds ([32 x i8], [32 x i8]* @"$string.2", i32 0, i32 0), [256 x i8]* %12)
  ret void
}

declare dso_local i32 @printf(i8*, ...)

definition  {} -> {}
assembly for function at address 0x10d674000
0x10d674000(+0):        stp x28, x27, [sp, #-48]!
0x10d674004(+4):        stp x20, x19, [sp, #16]
0x10d674008(+8):        stp x29, x30, [sp, #32]
0x10d67400c(+12):       sub sp, sp, #848
0x10d674010(+16):       adrp    x0, #0
0x10d674014(+20):       ldr x0, [x0, #120]
0x10d674018(+24):       adrp    x19, #0
0x10d67401c(+28):       ldr x19, [x19, #112]
0x10d674020(+32):       add x20, sp, #8
0x10d674024(+36):       add x1, sp, #8
0x10d674028(+40):       blr x19
0x10d67402c(+44):       add x1, x20, #568
0x10d674030(+48):       adrp    x0, #0
0x10d674034(+52):       ldr x0, [x0, #104]
0x10d674038(+56):       blr x19
0x10d67403c(+60):       orr x1, x20, #0x4
0x10d674040(+64):       adrp    x0, #0
0x10d674044(+68):       ldr x0, [x0, #96]
0x10d674048(+72):       blr x19
0x10d67404c(+76):       add sp, sp, #848
0x10d674050(+80):       ldp x29, x30, [sp, #32]
0x10d674054(+84):       ldp x20, x19, [sp, #16]
0x10d674058(+88):       ldp x28, x27, [sp], #48
0x10d67405c(+92):       ret
shoe42 commented 2 years ago

So running the test from https://github.com/terralang/terra/issues/597#issuecomment-1221050915 gives:

Config:               0x0x0000000000000000
xBCLeftInflowProfile: 0x0x0000000000000000
xBCLeftHeat:          0x0x0000000000000000

Which is strange. Even stranger, re-adding the two memcpy calls to copy the strings before printing addresses gives:

Config:               0x0x0000000000007ffc
xBCLeftInflowProfile: 0x0x0000000000007ffc
xBCLeftHeat:          0x0x0000000000007ffc
shoe42 commented 2 years ago

Removing the call to terralib.saveobj and disas seems to flip the above behaviour with 0x0000000000000000/0x0000000000007ffc. Not a clue what's going on.

shoe42 commented 2 years ago

Could this somehow be an issue with vararg behaviour on aarch64 macOS being different to aarch64 Linux/general ARMv8 or aarch64 ABI? https://developer.apple.com/documentation/xcode/writing-arm64-code-for-apple-platforms

elliottslaughter commented 2 years ago

Hm, maybe. I'd expect LLVM to handle most differences in the C ABI, at least when calling functions with scalar arguments (including, I believe, variadic functions).

But let's test that! We can run Clang to see what code it generates:

#include "stdbool.h"
#include "stdint.h"
#include "stdlib.h"
#include "stdio.h"
#include "string.h"

typedef int TempProfile;
#define TempProfile_File 0

typedef int InflowProfile;
#define InflowProfile_File 0
#define InflowProfile_SuctionAndBlowing 1

struct Config {
  struct  {
    struct  {
      int32_t type;
      union  {
        struct  {
          int8_t FileDir[256];
        } File;
      } u;
    } xBCLeftHeat;
    struct  {
      int32_t type;
      union  {
        struct  {
          double addedVelocity;
          int8_t FileDir[256];
        } File;
        struct  {
          double sigma;
          struct  {
            uint32_t length;
            double values[10];
          } beta;
          double Zw;
          struct  {
            uint32_t length;
            double values[10];
          } A;
          struct  {
            uint32_t length;
            double values[10];
          } omega;
        } SuctionAndBlowing;
      } u;
    } yBCLeftInflowProfile;
    struct  {
      int32_t type;
      union  {
        struct  {
          double addedVelocity;
          int8_t FileDir[256];
        } File;
        struct  {
          double sigma;
          struct  {
            uint32_t length;
            double values[10];
          } beta;
          double Zw;
          struct  {
            uint32_t length;
            double values[10];
          } A;
          struct  {
            uint32_t length;
            double values[10];
          } omega;
        } SuctionAndBlowing;
      } u;
    } xBCLeftInflowProfile;
  } BC;
};

int main() {
  struct Config config;
  printf("Config:               %p\n", &(config));
  printf("xBCLeftInflowProfile: %p\n", &(config.BC.xBCLeftInflowProfile.u.File.FileDir));
  printf("xBCLeftHeat:          %p\n", &(config.BC.xBCLeftHeat.u.File.FileDir));
  return 0;
}

Then run:

clang bug372d.c -o bug372.ll -S -emit-llvm

Should give you the unoptimized LLVM IR, which we can compare to what Terra produces.

You can also run the program to sanity check it's behavior:

(On x86:)

$ clang bug372d.c -o bug372
$ ./bug372 
Config:               0x7ff7babbb500
xBCLeftInflowProfile: 0x7ff7babbb738
xBCLeftHeat:          0x7ff7babbb504

For best results, it's important to use the same Clang that you used to build Terra.

Note in the C program I also used %p instead of %0.16p. It turns out the latter is undefined behavior for pointers. I guess we should try that out in Terra too in case it makes any difference.

We could potentially add back more of bug372d.t, and write the corresponding pieces of bug372d.c. By comparing the generated code, we should eventually be able to spot the problem. Not all differences are an issue (Terra intentionally does some of its code generation differently), but the problem will (likely) be contained in the differences in the generated code.

Another possible explanation is that we're doing something bad with the JIT, but if so I don't see the mechanism for how that would be going wrong so far.

shoe42 commented 2 years ago

%struct.Config = type { %struct.anon } %struct.anon = type { %struct.anon.0, %struct.anon.2, %struct.anon.9 } %struct.anon.0 = type { i32, %union.anon } %union.anon = type { %struct.anon.1 } %struct.anon.1 = type { [256 x i8] } %struct.anon.2 = type { i32, %union.anon.3 } %union.anon.3 = type { %struct.anon.5 } %struct.anon.5 = type { double, %struct.anon.6, double, %struct.anon.7, %struct.anon.8 } %struct.anon.6 = type { i32, [10 x double] } %struct.anon.7 = type { i32, [10 x double] } %struct.anon.8 = type { i32, [10 x double] } %struct.anon.9 = type { i32, %union.anon.10 } %union.anon.10 = type { %struct.anon.12 } %struct.anon.12 = type { double, %struct.anon.13, double, %struct.anon.14, %struct.anon.15 } %struct.anon.13 = type { i32, [10 x double] } %struct.anon.14 = type { i32, [10 x double] } %struct.anon.15 = type { i32, [10 x double] } %struct.anon.11 = type { double, [256 x i8] }

@.str = private unnamed_addr constant [26 x i8] c"Config: %p\0A\00", align 1 @.str.1 = private unnamed_addr constant [26 x i8] c"xBCLeftInflowProfile: %p\0A\00", align 1 @.str.2 = private unnamed_addr constant [26 x i8] c"xBCLeftHeat: %p\0A\00", align 1

; Function Attrs: noinline nounwind optnone ssp uwtable define i32 @main() #0 { %1 = alloca i32, align 4 %2 = alloca %struct.Config, align 8 store i32 0, i32 %1, align 4 %3 = call i32 (i8, ...) @printf(i8 getelementptr inbounds ([26 x i8], [26 x i8] @.str, i64 0, i64 0), %struct.Config %2) %4 = getelementptr inbounds %struct.Config, %struct.Config %2, i32 0, i32 0 %5 = getelementptr inbounds %struct.anon, %struct.anon %4, i32 0, i32 2 %6 = getelementptr inbounds %struct.anon.9, %struct.anon.9 %5, i32 0, i32 1 %7 = bitcast %union.anon.10 %6 to %struct.anon.11 %8 = getelementptr inbounds %struct.anon.11, %struct.anon.11 %7, i32 0, i32 1 %9 = call i32 (i8, ...) @printf(i8 getelementptr inbounds ([26 x i8], [26 x i8] @.str.1, i64 0, i64 0), [256 x i8] %8) %10 = getelementptr inbounds %struct.Config, %struct.Config %2, i32 0, i32 0 %11 = getelementptr inbounds %struct.anon, %struct.anon %10, i32 0, i32 0 %12 = getelementptr inbounds %struct.anon.0, %struct.anon.0 %11, i32 0, i32 1 %13 = bitcast %union.anon %12 to %struct.anon.1 %14 = getelementptr inbounds %struct.anon.1, %struct.anon.1 %13, i32 0, i32 0 %15 = call i32 (i8, ...) @printf(i8 getelementptr inbounds ([26 x i8], [26 x i8] @.str.2, i64 0, i64 0), [256 x i8]* %14) ret i32 0 }

declare i32 @printf(i8*, ...) #1

attributes #0 = { noinline nounwind optnone ssp uwtable "frame-pointer"="non-leaf" "min-legal-vector-width"="0" "no-trapping-math"="true" "probe-stack"="__chkstk_darwin" "stack-protector-buffer-size"="8" "target-cpu"="apple-m1" "target-features"="+aes,+crc,+crypto,+dotprod,+fp-armv8,+fp16fml,+fullfp16,+lse,+neon,+ras,+rcpc,+rdm,+sha2,+sha3,+sm4,+v8.5a,+zcm,+zcz" } attributes #1 = { "frame-pointer"="non-leaf" "no-trapping-math"="true" "probe-stack"="__chkstk_darwin" "stack-protector-buffer-size"="8" "target-cpu"="apple-m1" "target-features"="+aes,+crc,+crypto,+dotprod,+fp-armv8,+fp16fml,+fullfp16,+lse,+neon,+ras,+rcpc,+rdm,+sha2,+sha3,+sm4,+v8.5a,+zcm,+zcz" }

!llvm.module.flags = !{!0, !1, !2, !3, !4, !5, !6, !7, !8} !llvm.ident = !{!9}

!0 = !{i32 2, !"SDK Version", [2 x i32] [i32 12, i32 3]} !1 = !{i32 1, !"wchar_size", i32 4} !2 = !{i32 1, !"branch-target-enforcement", i32 0} !3 = !{i32 1, !"sign-return-address", i32 0} !4 = !{i32 1, !"sign-return-address-all", i32 0} !5 = !{i32 1, !"sign-return-address-with-bkey", i32 0} !6 = !{i32 7, !"PIC Level", i32 2} !7 = !{i32 7, !"uwtable", i32 1} !8 = !{i32 7, !"frame-pointer", i32 1} !9 = !{!"Apple clang version 13.1.6 (clang-1316.0.21.2.5)"}

- Sanity check of `bug372d.c`:
```sh
$ clang bug372d.c -o bug372 && ./bug372
Config:               0x16daf2bc0
xBCLeftInflowProfile: 0x16daf2df8
xBCLeftHeat:          0x16daf2bc4
shoe42 commented 2 years ago

So, been doing a bit of digging, and this does seem to be a calling convention issue.

Here's the disassembled _main of the C version, bug372d.c:

0000000100003ec4 <_main>:
100003ec4: f4 4f be a9  stp x20, x19, [sp, #-32]!
100003ec8: fd 7b 01 a9  stp x29, x30, [sp, #16]
100003ecc: fd 43 00 91  add x29, sp, #16
100003ed0: ff 83 0d d1  sub sp, sp, #864
100003ed4: 1f 20 03 d5  nop
100003ed8: 88 09 00 58  ldr x8, 0x100004008 <_printf+0x100004008>
100003edc: 08 01 40 f9  ldr x8, [x8]
100003ee0: a8 83 1e f8  stur    x8, [x29, #-24]
100003ee4: f3 43 00 91  add x19, sp, #16
100003ee8: f3 03 00 f9  str x19, [sp]
100003eec: e0 03 00 10  adr x0, #124
100003ef0: 1f 20 03 d5  nop
100003ef4: 1a 00 00 94  bl  0x100003f5c <_printf+0x100003f5c>
100003ef8: 68 e2 08 91  add x8, x19, #568
100003efc: e8 03 00 f9  str x8, [sp]
100003f00: 00 04 00 50  adr x0, #130
100003f04: 1f 20 03 d5  nop
100003f08: 15 00 00 94  bl  0x100003f5c <_printf+0x100003f5c>
100003f0c: 68 02 7e b2  orr x8, x19, #0x4
100003f10: e8 03 00 f9  str x8, [sp]
100003f14: 40 04 00 10  adr x0, #136
100003f18: 1f 20 03 d5  nop
100003f1c: 10 00 00 94  bl  0x100003f5c <_printf+0x100003f5c>
100003f20: a8 83 5e f8  ldur    x8, [x29, #-24]
100003f24: 1f 20 03 d5  nop
100003f28: 09 07 00 58  ldr x9, 0x100004008 <_printf+0x100004008>
100003f2c: 29 01 40 f9  ldr x9, [x9]
100003f30: 3f 01 08 eb  cmp x9, x8
100003f34: c1 00 00 54  b.ne    0x100003f4c <_main+0x88>
100003f38: 00 00 80 52  mov w0, #0
100003f3c: ff 83 0d 91  add sp, sp, #864
100003f40: fd 7b 41 a9  ldp x29, x30, [sp, #16]
100003f44: f4 4f c2 a8  ldp x20, x19, [sp], #32
100003f48: c0 03 5f d6  ret
100003f4c: 01 00 00 94  bl  0x100003f50 <_printf+0x100003f50>

Note how it stores the first argument to printf in x0, and the second argument (the vararg one) on the stack (str x8, [sp]). I was under the impression x0-x7 were used for variable arguments where possible, and overflow would be placed onto the stack, but I could be wrong.

Anyway, disassembly of the Terra-generated _main (not _$main), built from the LLVM IR output from https://github.com/terralang/terra/issues/597#issuecomment-1221050915, is as follows:

0000000100003e4c <_main>:
100003e4c: f4 4f be a9  stp x20, x19, [sp, #-32]!
100003e50: fd 7b 01 a9  stp x29, x30, [sp, #16]
100003e54: ff 43 0d d1  sub sp, sp, #848
100003e58: 00 05 00 10  adr x0, #160
100003e5c: 1f 20 03 d5  nop
100003e60: e1 23 00 91  add x1, sp, #8
100003e64: f3 23 00 91  add x19, sp, #8
100003e68: 21 00 00 94  bl  0x100003eec <_printf+0x100003eec>
100003e6c: 20 05 00 50  adr x0, #166
100003e70: 61 e2 08 91  add x1, x19, #568
100003e74: 1f 20 03 d5  nop
100003e78: 1d 00 00 94  bl  0x100003eec <_printf+0x100003eec>
100003e7c: 80 05 00 10  adr x0, #176
100003e80: 61 02 7e b2  orr x1, x19, #0x4
100003e84: 1f 20 03 d5  nop
100003e88: 19 00 00 94  bl  0x100003eec <_printf+0x100003eec>
100003e8c: ff 43 0d 91  add sp, sp, #848
100003e90: fd 7b 41 a9  ldp x29, x30, [sp, #16]
100003e94: f4 4f c2 a8  ldp x20, x19, [sp], #32
100003e98: c0 03 5f d6  ret

Note how it's using both x0 and x1 for passing arguments to printf. It's even calculating the correct addresses: the 2nd and 3rd addresses should be 568 bytes and 4 bytes greater than &config, respectively. However, there's no stack.

Running the first _main gives the following output:

Config:               0x16f0de690
xBCLeftInflowProfile: 0x16f0de8c8
xBCLeftHeat:          0x16f0de694

And the Terra-generated version gives:

Config:               0x100b58310
xBCLeftInflowProfile: 0x100b58310
xBCLeftHeat:          0x100b58310

This looked strange initially as multiple runs of the C version always seemed to have the stack somewhere in the 0x16xxxxxxx range. So, running the Terra-generated _main through lldb shows that x1 is having the correct address calculated:

...
(lldb) 
Process 29276 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = instruction step into
    frame #0: 0x0000000100003e68 bug372_terra`main + 28
bug372_terra`main:
->  0x100003e68 <+28>: bl     0x100003eec               ; symbol stub for: printf
    0x100003e6c <+32>: adr    x0, #0xa6                 ; "xBCLeftInflowProfile: %p\n"
    0x100003e70 <+36>: add    x1, x19, #0x238
    0x100003e74 <+40>: nop    
Target 0: (bug372_terra) stopped.
(lldb) reg read x0 x1
      x0 = 0x0000000100003ef8  "Config:               %p\n"
      x1 = 0x000000016fdfe648
...
(lldb) 
Process 29276 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = instruction step into
    frame #0: 0x0000000100003e78 bug372_terra`main + 44
bug372_terra`main:
->  0x100003e78 <+44>: bl     0x100003eec               ; symbol stub for: printf
    0x100003e7c <+48>: adr    x0, #0xb0                 ; "xBCLeftHeat:          %p\n"
    0x100003e80 <+52>: orr    x1, x19, #0x4
    0x100003e84 <+56>: nop    
Target 0: (bug372_terra) stopped.
(lldb) reg r x0 x1
      x0 = 0x0000000100003f12  "xBCLeftInflowProfile: %p\n"
      x1 = 0x000000016fdfe880
...
(lldb) 
Process 29276 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = instruction step into
    frame #0: 0x0000000100003e88 bug372_terra`main + 60
bug372_terra`main:
->  0x100003e88 <+60>: bl     0x100003eec               ; symbol stub for: printf
    0x100003e8c <+64>: add    sp, sp, #0x350
    0x100003e90 <+68>: ldp    x29, x30, [sp, #0x10]
    0x100003e94 <+72>: ldp    x20, x19, [sp], #0x20
Target 0: (bug372_terra) stopped.
(lldb) reg r x0 x1
      x0 = 0x0000000100003f2c  "xBCLeftHeat:          %p\n"
      x1 = 0x000000016fdfe64c
(lldb) x/2x $sp
0x16fdfe640: 0x00084310 0x00000001

So, Terra-generated IR is, for some reason, placing the printf vararg into x1, while the C-generated IR is placing it on the stack, which is where printf is reading it from - note the last command run above, showing 0x100084310 on the stack.

I can see in the LLVM IR that Terra for some reason is defining a different set of arguments for printf:

shoe42 commented 2 years ago

Looks like this issue is with Terra's generation of the printf calls. Replacing the calls to i32 (i8*, %struct.Config*, ...)/i32 (i8*, [256 x i8]*, ...) with calls to i32 (i8*, ...) allows LLVM to properly pass the varargs:

; ModuleID = 'terra'
source_filename = "terra"
target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"
target triple = "arm64-apple-darwin21.6.0"

%struct.Config = type { %struct.anon.1 }
%struct.anon.1 = type { %struct.anon.2, %struct.anon.4, %struct.anon.4 }
%struct.anon.2 = type { i32, %union.anon }
%union.anon = type { %struct.anon.3 }
%struct.anon.3 = type { [256 x i8] }
%struct.anon.4 = type { i32, %union.anon.5 }
%union.anon.5 = type { %struct.anon.7 }
%struct.anon.7 = type { double, %struct.anon.8, double, %struct.anon.8, %struct.anon.8 }
%struct.anon.8 = type { i32, [10 x double] }
%struct.anon.6 = type { double, [256 x i8] }

@"$string" = private unnamed_addr constant [26 x i8] c"Config:               %p\0A\00", align 1
@"$string.1" = private unnamed_addr constant [26 x i8] c"xBCLeftInflowProfile: %p\0A\00", align 1
@"$string.2" = private unnamed_addr constant [26 x i8] c"xBCLeftHeat:          %p\0A\00", align 1

define dso_local void @main() {
entry:
  %config = alloca %struct.Config, align 8
  ;%0 = call i32 (i8*, %struct.Config*, ...) bitcast (i32 (i8*, ...)* @printf to i32 (i8*, %struct.Config*, ...)*)(i8* getelementptr inbounds ([26 x i8], [26 x i8]* @"$string", i32 0, i32 0), %struct.Config* %config)
  %0 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([26 x i8], [26 x i8]* @"$string", i32 0, i32 0), %struct.Config* %config)

  %1 = getelementptr %struct.Config, %struct.Config* %config, i32 0, i32 0
  %2 = getelementptr %struct.anon.1, %struct.anon.1* %1, i32 0, i32 2
  %3 = getelementptr %struct.anon.4, %struct.anon.4* %2, i32 0, i32 1
  %4 = getelementptr %union.anon.5, %union.anon.5* %3, i32 0, i32 0
  %5 = bitcast %struct.anon.7* %4 to %struct.anon.6*
  %6 = getelementptr %struct.anon.6, %struct.anon.6* %5, i32 0, i32 1
  ;%7 = call i32 (i8*, [256 x i8]*, ...) bitcast (i32 (i8*, ...)* @printf to i32 (i8*, [256 x i8]*, ...)*)(i8* getelementptr inbounds ([26 x i8], [26 x i8]* @"$string.1", i32 0, i32 0), [256 x i8]* %6)
  %7 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([26 x i8], [26 x i8]* @"$string.1", i32 0, i32 0), [256 x i8]* %6)

  %8 = getelementptr %struct.Config, %struct.Config* %config, i32 0, i32 0
  %9 = getelementptr %struct.anon.1, %struct.anon.1* %8, i32 0, i32 0
  %10 = getelementptr %struct.anon.2, %struct.anon.2* %9, i32 0, i32 1
  %11 = getelementptr %union.anon, %union.anon* %10, i32 0, i32 0
  %12 = getelementptr %struct.anon.3, %struct.anon.3* %11, i32 0, i32 0
  ;%13 = call i32 (i8*, [256 x i8]*, ...) bitcast (i32 (i8*, ...)* @printf to i32 (i8*, [256 x i8]*, ...)*)(i8* getelementptr inbounds ([26 x i8], [26 x i8]* @"$string.2", i32 0, i32 0), [256 x i8]* %12)
  %13 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([26 x i8], [26 x i8]* @"$string.2", i32 0, i32 0), [256 x i8]* %12)
  ret void
}

declare dso_local i32 @printf(i8*, ...)

; definition      {} -> {}
define dso_local void @"$main"() {
entry:
  %config = alloca %struct.Config, align 8
  %0 = call i32 (i8*, %struct.Config*, ...) bitcast (i32 (i8*, ...)* @printf to i32 (i8*, %struct.Config*, ...)*)(i8* getelementptr inbounds ([26 x i8], [26 x i8]* @"$string", i64 0, i64 0), %struct.Config* nonnull %config)
  %1 = getelementptr inbounds %struct.Config, %struct.Config* %config, i64 0, i32 0, i32 2, i32 1, i32 0, i32 1
  %2 = bitcast %struct.anon.8* %1 to [256 x i8]*
  %3 = call i32 (i8*, [256 x i8]*, ...) bitcast (i32 (i8*, ...)* @printf to i32 (i8*, [256 x i8]*, ...)*)(i8* getelementptr inbounds ([26 x i8], [26 x i8]* @"$string.1", i64 0, i64 0), [256 x i8]* nonnull %2)
  %4 = getelementptr inbounds %struct.Config, %struct.Config* %config, i64 0, i32 0, i32 0, i32 1, i32 0, i32 0
  %5 = call i32 (i8*, [256 x i8]*, ...) bitcast (i32 (i8*, ...)* @printf to i32 (i8*, [256 x i8]*, ...)*)(i8* getelementptr inbounds ([26 x i8], [26 x i8]* @"$string.2", i64 0, i64 0), [256 x i8]* nonnull %4)
  ret void
}

Result:

Config:               0x16f8c6678
xBCLeftInflowProfile: 0x16f8c68b0
xBCLeftHeat:          0x16f8c667c

Not sure why it's treating printf as a i32 (i8*, %struct.Config*, ...)/i32 (i8*, [256 x i8]*, ...) function, however. Is this due to Terra's glue generating incorrect types for functions with varargs somehow?

I'd guess that the Linux test failures, which this issue was initially opened for, are unrelated to this, so maybe this belongs better in a separate issue?

elliottslaughter commented 2 years ago

Wow, thanks for doing all this digging.

Yes, I think this is a separate issue: why is Terra thinking that printf is a i32 (i8*, T*, ...) (where T is the second argument) instead of i32 (i8*, ...). This could be in Terra's code generation, or it could be somewhere in the glue Terra uses to extract these definitions from Clang.

elliottslaughter commented 2 years ago

Just one more note: the place to look for issues is likely going to be EmitCall, or the Classify call that it immediately does. This is a bit tricky because it's not just a matter of taking the function as-is and calling it. (E.g., structs have to be decomposed into scalar/vector/array arguments in a way that's architecture-specific.) So that's probably why there's an opportunity for Terra to mess this up.

https://github.com/terralang/terra/blob/0cf6be68edccc8501991d768cc9b939982d78194/src/tcompiler.cpp#L1467-L1470

Hopefully it's just a matter of teaching Terra not to add arguments beyond those in the original function to the formal type it returns, but this may potentially go deeper into the compiler plumbing.

shoe42 commented 2 years ago

I think trying to track this down is beyond me at the moment, I'm not familiar with Terra's (or LLVM's) compilation process, but might be able to take a look at somepoint in the future. Probably worth opening a separate issue for it.

elliottslaughter commented 2 years ago

@shoe42 I have a potential fix for the varargs issue at #602. Can you please check if this works for you, and if so what do the test suite results look like after this?

elliottslaughter commented 1 year ago

Coming back to the issue of Linux AArch64, I see that the following tests are still failing:

98% tests passed, 9 tests failed out of 549
Total Test time (real) =  32.29 sec
The following tests FAILED:
    121 - defercond.t (SEGFAULT)
    209 - let2.t (SEGFAULT)
    343 - simpleglobal.t (Failed)
    396 - teststd.t (SEGFAULT)
    412 - vars.t (SEGFAULT)
    413 - vars2.t (SEGFAULT)
    415 - vecarith.t (SEGFAULT)
    424 - zeroreturn.t (SEGFAULT)
    425 - zeroreturn2.t (SEGFAULT)

From https://github.com/terralang/terra/pull/625