Closed dinosaure closed 10 months ago
Here's some examples of ways you could troubleshoot this.
Try putting ShowCrashReports()
at the top of your main()
function. You'll need to #include <cosmo.h>
. This should cause a crash report to be printed to stderr, provided signals aren't blocked.
You should be able to gdb xxd.aarch64.elf
and set a break setjmp
and then stepi
until it crashes, so you can poke around and see what specific thing is doing it. Saying layout asm
and layout reg
in gdb can be helpful if it's an assembly level error. Otherwise layout src
and layout reg
for source tui.
That should hopefully give us some additional clues. I'm also surprised your terminal didn't print Segmentation fault.
since xxd appears to be successfully printing its output in your trace.
Another thing worth noting for GDB debugging, is I'm ashamed to admit you might want to create a symlink named /home/jart/cosmo
that points to your cosmopolitan
mono repo directory. That way you'll be able to see the libc source code when you debug.
I made a simpler program:
let () = print_endline "Hello World!"
Again, with ./main.aarch64.exe.dbg --strace
, the program works well on linux/arm64 arm64v8/ubuntu
. But without the --strace
option, it segfaults again. I tried to use gdb
but if I do break setjmp
, run
, the program fails with:
(gdb) break setjmp
Breakpoint 1 at 0x10000041f24: file libc/nexgen32e/setjmp.S, line 31.
(gdb) run
Starting program: /root/main.aarch64.exe.dbg
warning: Error disabling address space randomization: Operation not permitted
warning: Could not trace the inferior process.
warning: ptrace: Function not implemented
During startup program exited with code 127.
(gdb)
Not sure how to debug this situation. I upload the artifiact made by the esperanto
toolchain here: main.aarch64.exe.zip.
POST: I noticed that a core-dump was made by qemu
(due to docker run
+ qemu
). So I uploaded it here:
dump.zip
And I can see that from gdb
:
(gdb) where
#0 0x000001000004116c in ?? ()
#1 0x0000000000100000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
Your qemu-aarch64 docker environment looks broken. Even when qemu + gdb works, it's painful. You'll have a much better experience if you get something like a Raspberry Pi so you can do your debugging on the genuine article.
I finally got an aarch64
machine and, indeed, the gdb
output is much better than a docker
with qemu
. With the same main.aarch64.elf
artifact, I have this output:
(gdb) run
Starting program: /root/main.aarch64.elf
Program received signal SIGSEGV, Segmentation fault.
pthread_mutex_lock (mutex=0x10000078560 <__mmi_lock_obj>) at ./libc/thread/tls.h:76
76 ./libc/thread/tls.h: No such file or directory.
(gdb) where
#0 pthread_mutex_lock (mutex=0x10000078560 <__mmi_lock_obj>) at ./libc/thread/tls.h:76
#1 0x0000010000040fd0 in __mmi_lock () at libc/intrin/mmi_lock.c:27
#2 0x0000010000034ad4 in mmap (addr=0x0, size=131072, prot=3, flags=34, fd=-1, off=0) at libc/runtime/mmap.c:478
#3 0x0000010000033dc0 in _mapanon (size=131072) at libc/runtime/mapanon.c:61
#4 0x0000010000033468 in dlmalloc_requires_more_vespene_gas (size=<optimized out>) at third_party/dlmalloc/vespene.c:31
#5 0x00000100000307f8 in sys_alloc (nb=nb@entry=65632, m=0x100000802b8 <_gm_>) at third_party/dlmalloc/dlmalloc.c:187
#6 0x00000100000312c4 in __dlmalloc (bytes=65616) at third_party/dlmalloc/dlmalloc.c:712
#7 0x000001000002ea60 in malloc (n=<optimized out>) at libc/mem/malloc.c:46
#8 0x0000010000009a7c in caml_stat_alloc_noexc (sz=65616) at memory.c:799
#9 caml_stat_alloc (sz=65616) at memory.c:821
#10 0x000001000000fbf8 in caml_open_descriptor_in (fd=0) at io.c:98
#11 0x0000010000010ba4 in caml_ml_open_descriptor_in (fd=<optimized out>) at io.c:517
#12 0x0000010000022ab4 in caml_c_call ()
#13 0x0000010000004864 in camlStdlib__entry () at stdlib.ml:314
#14 0x00000100000014d4 in caml_program ()
#15 0x0000010000022b24 in caml_start_program ()
#16 0x0000010000023384 in caml_startup_common (argv=0x10000075c28, pooling=<optimized out>, pooling@entry=0) at startup_nat.c:160
#17 0x000001000002345c in caml_startup_exn (argv=<optimized out>) at startup_nat.c:167
#18 caml_startup (argv=<optimized out>) at startup_nat.c:172
#19 caml_main (argv=<optimized out>) at startup_nat.c:179
#20 0x000001000000059c in main (argc=<optimized out>, argv=<optimized out>) at main.c:37
#21 0x0000010000000d30 in cosmo (sp=0xfffffffff3d0, m1=0x0) at libc/runtime/cosmo2.c:177
#22 0x0000010000000144 in _start () at libc/crt/crt.S:144
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) display i/$pc
1: x/i $pc
=> 0x1000004116c <pthread_mutex_lock>: ldur x1, [x28, #-128]
(gdb) info registers
x0 0x10000078560 1099512120672
x1 0x20000 131072
x2 0x3 3
x3 0x22 34
x4 0xffffffff 4294967295
x5 0x0 0
x6 0x100000741f0 1099512103408
x7 0x10000075cc0 1099512110272
x8 0x10000010b8c 1099511696268
x9 0x10000075cd8 1099512110296
x10 0x10000074068 1099512103016
x11 0x10000075cf0 1099512110320
x12 0x4fd 1277
x13 0x10000075d08 1099512110344
x14 0x7fffffffffffffff 9223372036854775807
x15 0x10000075d20 1099512110368
x16 0xfffffffff120 281474976706848
x17 0x10000001498 1099511633048
x18 0x0 0
x19 0xffffffff 4294967295
x20 0x20000 131072
x21 0x0 0
x22 0x20000 131072
x23 0x3 3
x24 0x22 34
x25 0x0 0
x26 0xfffffffff140 281474976706880
x27 0x100080400f40 17594337726272
x28 0x100080040010 17594333790224
x29 0xffffffffef60 281474976706400
x30 0x10000040fd0 1099511893968
sp 0xffffffffef60 0xffffffffef60
pc 0x1000004116c 0x1000004116c <pthread_mutex_lock>
cpsr 0x80001000 [ EL=0 BTYPE=0 SSBS N ]
fpsr 0x0 [ ]
fpcr 0x0 [ RMode=0 ]
It seems related to pthread_mutex_lock
.
EDIT: it's more about TLS than pthread_mutex_lock
. I think, it's the __get_tls
function. So it's probably related to how I link the program?
Now that's very interesting. Thanks for getting the RasPi. That's going to make it much easier for me to support you.
On AARCH64, Cosmopolitan reserves the x28 register for itself. It's the Libc register. We need it in order to do thread-local storage in such a way that it'll work on platforms like Apple Silicon, be easy, and most importantly be fast. In order for it to work, cosmocc is designed to compile every single module in your application using the -ffixed-x28
flag.
There's two likely causes of this issue:
Hope this helps!
There's some Ocaml assembly source code somewhere, that was handwritten, which is clobbering x28. In that case, you need to change the assembly code to leave x28 alone.
I think that's the case, OCaml seems to use x28
to store the state of the domain (something like a metadata needed by the OCaml runtime). It seems that you fix x18
also, should I try to inhibit OCaml to not use these registers?
Yes x18 is the platform register. We can't use it because Apple reserves it. https://developer.apple.com/documentation/xcode/writing-arm64-code-for-apple-platforms
I'm finally able to produce an executable with OCaml which works. I restricted OCaml to use less registers and let x28
free for Cosmopolitan (I decided to took x25
instead). So now it works :tada: for small projects. I hope that it will works for bigger projects! The patch on the OCaml compiler is available here.
I currently try to upgrade
esperanto
with Cosmopolitan 3.1.3. Currently the artifact works fine onx86_64
but it segfaults onaarch64
(I tookdocker run -it --rm --platform linux/arm64 arm64v8/ubuntu
to test). However, if I use the--strace
option, the program works fine.Currently, the way to produce such artifact from an OCaml code is a bit hard but available here: https://github.com/dinosaure/esperanto/pull/43 (see the
./README.md
updated according to the usage ofapelink
). I uploaded what I produced from this little project in OCaml:hxd
. This is the output of--strace
withecho "Salut"|./xxd.com
:And this is the output for
--ftrace
:The artifact was made with
aarch64-unknown-cosmo-cc
on one side andx86_64-unknown-cosmo-cc
on the other side. These artifact was linked together with:These artifacts was built with
dune
and the Esperanto toolchain.xxd.zip