nanovms / ops

ops - build and run nanos unikernels
https://ops.city
MIT License
1.3k stars 132 forks source link

Basic rust example crashes #1649

Closed julianbraha closed 3 months ago

julianbraha commented 3 months ago

I'm trying to test out ops on the most basic Rust example, but it crashes.

Here's my main.rs:

fn main() {
    println!("yo!")
}

Which I compiled with rustc main.rs -o main

And then when I ran ops run main:

running local instance
booting /home/julian/.ops/images/main ...
en1: assigned 10.0.2.15

*** signal 4 received by tid 2, errno 0, code 1

*** Thread context:
lastvector: 0000000000000006 (Invalid opcode (UD2))
     frame: ffffc00002c02800
      type: thread
active_cpu: 00000000ffffffff
 stack top: 0000000000000000

   rax: 0000000000000000
   rbx: 0000000000000000
   rcx: 00000ebd3c766d00
   rdx: 0000000000000000
   rsi: 00000ebd3c766d00
   rdi: 0000000000000000
   rbp: 00000000ffd26de0
   rsp: 00000000ffd26d40
    r8: 00000ebd3c7a0f60
    r9: 00000ebd3c7a0ee0
   r10: 0000000000000000
   r11: 0000000000000200
   r12: 0000000000000000
   r13: 00000ebd3c766000
   r14: 0000000000000000
   r15: 00000ebd3c766000
   rip: 00000ebd3c7875ad
rflags: 0000000000010246
    ss: 000000000000002b
    cs: 0000000000000023
    ds: 0000000000000000
    es: 0000000000000000
fsbase: 0000000000000000
gsbase: 0000000000000000

frame trace:
00000000ffd26de8:   00000ebd3c786508

kernel load offset ffffffffa0d72000

loaded klibs:

stack trace:
00000000ffd26d40:   ffffc0000013f1b0
00000000ffd26d48:   0000000000000000
00000000ffd26d50:   0000000000000000
00000000ffd26d58:   ffffc0000012ffff
00000000ffd26d60:   000000000036010d
00000000ffd26d68:   00000000ffd26df0
00000000ffd26d70:   ffffc00000133d20
00000000ffd26d78:   ffffc0000013f048
00000000ffd26d80:   0000000000000000
00000000ffd26d88:   0000000000000000
00000000ffd26d90:   0000000000000000
00000000ffd26d98:   0000000000000000
00000000ffd26da0:   0000000000000000
00000000ffd26da8:   0000000000000000
00000000ffd26db0:   0000000000000000
00000000ffd26db8:   0000000000000000
00000000ffd26dc0:   0000000000000000
00000000ffd26dc8:   0000000000000000
00000000ffd26dd0:   0000000000000000
00000000ffd26dd8:   0000000000000000
00000000ffd26de0:   0000000000000000
00000000ffd26de8:   00000ebd3c786508
00000000ffd26df0:   0000000000000001
00000000ffd26df8:   00000000ffd26f70
00000000ffd26e00:   0000000000000000
00000000ffd26e08:   00000000ffd26f50
00000000ffd26e10:   00000000ffd26f40
00000000ffd26e18:   00000000ffd26f20
00000000ffd26e20:   00000000ffd26f00
00000000ffd26e28:   00000000ffd26ef0
00000000ffd26e30:   00000000ffd26ee0
00000000ffd26e38:   0000000000000000

   core dump

Here are my package versions:

  1. rustc: 1.80.0 (051478957 2024-07-21)
  2. Ops: 0.1.42
  3. Nanos: 0.1.51
eyberg commented 3 months ago

can you paste the output of ?

julianbraha commented 3 months ago

can you paste the output of ?

* `ops profile`

* `cat /etc/lsb-release`

* `uname -a`
$ ops profile
Ops version: 0.1.42
Nanos version: 0.1.51
Qemu version: 9.0.2
OS: linux
Arch: amd64
Virtualized: false

$ cat /etc/lsb-release
DISTRIB_ID=cachyos
DISTRIB_RELEASE="rolling"
DISTRIB_DESCRIPTION="CachyOS"

$ uname -a
Linux framework-laptop 6.10.3-3-cachyos #1 SMP PREEMPT_DYNAMIC Sun, 04 Aug 2024 09:34:45 +0000 x86_64 GNU/Linux
eyberg commented 3 months ago

looking at cachyos benefits/optimizations, you are more than likely trying to execute instructions that aren't being found; to figure out which one it is you can get a coredump as shown in https://docs.ops.city/ops/hypervisors/debugging#core-dumps

(note: i had to use ops run -c config.json main --nanos-version=0.1.47, which is a sep. issue)

running via gdb you should see the offending instruction it is hitting (perhaps avx512 related)

from there, you can disable the feature at compile-time https://doc.rust-lang.org/rustc/codegen-options/index.html#target-feature or we could look at it to see if it's something we could toggle on where appropriate

julianbraha commented 3 months ago

running via gdb you should see the offending instruction it is hitting (perhaps avx512 related)

Hmmm not sure how to interpret the output from GDB here. This is what I got:

$ rustc main.rs -o main

$ ops run -c config.json main --nanos-version=0.1.47
running local instance
booting /home/julian/.ops/images/main ...
en1: assigned 10.0.2.15
signal 4 (core dumped)
exit status 9

$ ops image cp main coredumps/core .

$ gdb -ex bt -ex quit main core
GNU gdb (GDB) 15.1
Copyright (C) 2024 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from main...
[New LWP 2]
Core was generated by `main'.
Program terminated with signal SIGILL, Illegal instruction.
#0  0x0000014f6c0265ad in ?? ()
#0  0x0000014f6c0265ad in ?? ()
#1  0x0000000000000000 in ?? ()
francescolavra commented 3 months ago

Since your executable is dynamically linked, by default Nanos applies address space layout randomization to it, that's why you can't map the addresses in the backtrace to program symbols. To disable randomization, you can add a "noaslr": "t" attribute to the "ManifestPassthrough" JSON object in your config.json file, as in:

"ManifestPassthrough": {
    "coredumplimit": "150m",
    "noaslr": "t"
  }

and the re-run Ops, copy the core dump file to the host, and open the file again with gdb; this time, you should be able to see the program symbols in the backtrace. In order to pinpoint the exact instruction that caused the fault, you can type disas /s *0x0000014f6c0265ad (replace the above number with the actual address in the first line of your backtrace) in the gdb prompt and see what instruction is at that address.

julianbraha commented 3 months ago

you can add a "noaslr": "t" attribute to the "ManifestPassthrough" JSON object in your config.json file

Tried this, but it didn't seem to change anything:

$ cat config.json
{
  "BaseVolumeSz": "200m",
  "ManifestPassthrough": {
    "coredumplimit": "150m",
    "noaslr": "t"
  }
}

$  ops run -c config.json main --nanos-version=0.1.47
running local instance
booting /home/julian/.ops/images/main ...
en1: assigned 10.0.2.15
signal 4 (core dumped)

$ ops image cp main coredumps/core .

$ gdb -ex bt -ex quit main core
GNU gdb (GDB) 15.1
Copyright (C) 2024 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from main...
[New LWP 2]
Core was generated by `main'.
Program terminated with signal SIGILL, Illegal instruction.
#0  0x00000001000255ad in ?? ()
#0  0x00000001000255ad in ?? ()
#1  0x0000000000000000 in ?? ()
francescolavra commented 3 months ago

Oops, I forgot that Nanos uses a static offset of 0x400000 when ASLR is disabled, so to get to the faulting instruction you have to subtract 0x400000 from the addresses in the backtrace. In your case, the command to type at the gdb prompt would be disas /s *0xffc255ad

julianbraha commented 3 months ago

Okay, it looks like the problematic instruction is vmovdqu8, which sure enough, is avx512.

I think this must have something to do with the system libraries, because when I try to compile the rust binary for a target without avx512 (e.g. nehalem), it's still present: rustc -C target-cpu=nehalem -C target-feature=+crt-static main.rs -o main

and again, in gdb:

#0  0x0000000000472c81 in _dl_aux_init ()
#0  0x0000000000472c81 in _dl_aux_init ()
#1  0x0000000000447f40 in __libc_start_main_impl ()
#2  0x00000000004104c5 in _start ()
(gdb) disas /s 0x0000000000472c81
Dump of assembler code for function _dl_aux_init:
   0x0000000000472c60 <+0>: endbr64
   0x0000000000472c64 <+4>: push   %rbp
   0x0000000000472c65 <+5>: vpxor  %xmm0,%xmm0,%xmm0
   0x0000000000472c69 <+9>: lea    -0x627d0(%rip),%rax        # 0x4104a0 <_start>
   0x0000000000472c70 <+16>:    mov    %rsp,%rbp
   0x0000000000472c73 <+19>:    sub    $0x1a0,%rsp
   0x0000000000472c7a <+26>:    mov    %rdi,0xe00a7(%rip)        # 0x552d28 <_dl_auxv>
=> 0x0000000000472c81 <+33>:    vmovdqu8 %zmm0,0x40(%rsp)
julianbraha commented 3 months ago

I installed the x86_64-unknown-linux-musl target in rustup, and after compiling with: rustc --target=x86_64-unknown-linux-musl main.rs -o main it works!

Thanks for your help everyone. Closing.