keith-packard / snek

Snek programming language for tiny systems
GNU General Public License v3.0
292 stars 30 forks source link

snek-arm fails on s390x with qemu >6.1 #58

Closed cpaelzer closed 2 years ago

cpaelzer commented 2 years ago

Hi, this sounds like a massive corner case of corner cases (embedded python simulated for arm running on mainframes), but the Ubuntu testing has spotted this in:

In there all work, but one test fails reproducible, that is test/pass-slice.py

When eliminating the automation in makefiles and all that it comes down to:

$ qemu-system-arm -chardev stdio,mux=on,id=stdio0 -serial none -monitor none -semihosting-config enable=on,chardev=stdio0,arg='snek',arg=test/pass-slice.py -machine mps2-an385,accel=tcg -cpu cortex-m3 -kernel /usr/share/snek/snek-qemu-arm-1.7.elf -nographic -bios none
fail: [::-5] (model 'o' impl '')

To be clear:

I already knew that all other snek tests in snek-arm work, so I have modified the test to see if everything else in the slice test would fail:

def check(model, impl, mess):
    if model != impl:
        print("fail: %s (model %r impl %r)" % (mess, model, impl))
    else:
        print("work: %s (model %r impl %r)" % (mess, model, impl))

That gave me:

work: [::] (model 'hello' impl 'hello')
work: [-5::] (model 'hello' impl 'hello')
work: [:-5:] (model '' impl '')
work: [-5:-5:] (model '' impl '')
fail: [::-5] (model 'o' impl '')
work: [-5::-5] (model 'h' impl 'h')
fail: [:-5:-5] (model 'o' impl '')
fail: [-5:-5:-5] (model '' impl 'h')
work: [::] (model [1, 2, 3, 4, 5] impl [1, 2, 3, 4, 5])
work: [-5::] (model [1, 2, 3, 4, 5] impl [1, 2, 3, 4, 5])
...

As mentioned the above happened on Ubuntu. But for the sake of trying to eliminate any Ubuntu-qemu-specifics I also used an s390x fedora35 XLD container. There I was unable to build snek there due to dependencies not existing in fedora, but I fetch and extract Ubuntu's recent snek and ran the test from snek-git vs the qemu (v6.1) in fedora35 - it fails the same way.

# fedora35 with qemu-6.1.0-14.fc35 on s390x
qemu-system-arm -chardev stdio,mux=on,id=stdio0 -serial none -monitor none -semihosting-config enable=on,chardev=stdio0,arg='snek',arg=test/pass-slice.py -machine mps2-an385,accel=tcg -cpu cortex-m3 -kernel /snek-qemu-arm-1.7.elf -nographic -bios none
fail: [::-5] (model 'o' impl '')

If you need a s390x system for debugging this you can get a free one in the linux one community cloud. I've checked and the error is reproducible in containers

I'm planning to also report this to qemu-devel, but it could be anything lacking the insight into how snek works and in this cases snek-qemu-arm-1.7.elf in particular. So if there is anything you could spot by looking at the output or checking the behavior that would in turn help the qemu developers to understand what might have changed that will be helpful.

cpaelzer commented 2 years ago

FYI - if needed Ubuntu's current snek elf file can be fetched from https://people.canonical.com/~paelzer/snek-qemu-arm-1.7.elf (as extracted from the deb)

keith-packard commented 2 years ago

whoa.

static volatile int num = -9;
static volatile int den = -5;

int
main(void)
{
        int quo = num / den;
        printf("num %d den %d quo %d\n", num, den, quo);
        exit(0);
}
$ ./snek-arm --version
QEMU emulator version 6.2.50 (v6.2.0-1556-ge56d873f0e)
Copyright (c) 2003-2022 Fabrice Bellard and the QEMU Project developers
$ /snek-arm
num -9 den -5 quo 0
$ ./snek-arm-bin --version
QEMU emulator version 4.2.1 (Debian 1:4.2-3ubuntu6.19)
Copyright (c) 2003-2019 Fabrice Bellard and the QEMU Project developers
$ ./snek-arm-bin
num -9 den -5 quo 1

These are both running exactly the same instructions, just changing the qemu version from the development tip back to the version provided in 20.04.

(gdb) x/20i main
   0x40 <main>: ldr     r2, [pc, #24]   ; (0x5c <main+28>)
   0x42 <main+2>:       push    {r3, lr}
   0x44 <main+4>:       ldr     r0, [r2, #0]
   0x46 <main+6>:       ldr     r3, [r2, #4]
   0x48 <main+8>:       ldr     r1, [r2, #0]
   0x4a <main+10>:      sdiv    r3, r0, r3
   0x4e <main+14>:      ldr     r2, [r2, #4]
   0x50 <main+16>:      ldr     r0, [pc, #12]   ; (0x60 <main+32>)
   0x52 <main+18>:      bl      0x140 <printf>
   0x56 <main+22>:      movs    r0, #0
   0x58 <main+24>:      bl      0xf8 <exit>

Here's a short debug session:

Breakpoint 1, main () at ../../chips/qemu/snek-qemu.c:135
135             int quo = num / den;
(gdb) display/i $pc
1: x/i $pc
=> 0x40 <main>: ldr     r2, [pc, #24]   ; (0x5c <main+28>)
(gdb) stepi
0x00000042      135             int quo = num / den;
1: x/i $pc
=> 0x42 <main+2>:       push    {r3, lr}
(gdb)
0x00000044      135             int quo = num / den;
1: x/i $pc
=> 0x44 <main+4>:       ldr     r0, [r2, #0]
(gdb)
0x00000046      135             int quo = num / den;
1: x/i $pc
=> 0x46 <main+6>:       ldr     r3, [r2, #4]
(gdb)
136             printf("num %d den %d quo %d\n", num, den, quo);
1: x/i $pc
=> 0x48 <main+8>:       ldr     r1, [r2, #0]
(gdb)
0x0000004a      136             printf("num %d den %d quo %d\n", num, den, quo);
1: x/i $pc
=> 0x4a <main+10>:      sdiv    r3, r0, r3
(gdb) print $r0
$1 = -9
(gdb) print $r3
$2 = -5
(gdb) stepi
0x0000004e      136             printf("num %d den %d quo %d\n", num, den, quo);
1: x/i $pc
=> 0x4e <main+14>:      ldr     r2, [r2, #4]
(gdb) print $r3
$3 = 0
(gdb)

Time to go figure out why qemu is busted.

keith-packard commented 2 years ago

Problem found and fixed in qemu. This looks like a pretty serious issue as it affects all 32-bit emulation running on 64-bit machines that require extension of 32-bit operands to 64-bits (PowerPC, Sparc and s390).

https://github.com/keith-packard/qemu/tree/signed-div-bug

I've sent this upstream to qemu for their evaluation.

I'll leave this open to remind us to poke the qemu folks until this is fixed.

keith-packard commented 2 years ago

This has been fixed in qemu now, https://github.com/qemu/qemu/commit/8929906e212cbe606e361cbd32917dcbe5bb6dd0 contains the patch.