golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
124.03k stars 17.67k forks source link

runtime: illegal instruction on ARMv5 #18694

Closed RobHumphrey closed 7 years ago

RobHumphrey commented 7 years ago

I've been trying to get Syncthing working on a WD My Book World Edition (White Light) NAS box, but encountered illegal instructions. Whilst investigating I found issue #15869, which seemed very similar, so hopefully I can continue to help resolve this issue.

What version of Go are you using (go version)?

go version go1.7.4 linux/amd64

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH=""
GORACE=""
GOROOT="/usr/local/bin/go"
GOTOOLDIR="/usr/local/bin/go/pkg/tool/linux_amd64"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build191961496=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"

What did you do?

Set up a simple infinite loop like:

package main

func main() {
    for true {
    }
}

Compile with:

env GOARCH=arm GOARM=5 GOOS=linux go build loopTest.go

Transfer the output across to the NAS box and run it.

If I resize the terminal window then it crashes with Illegal instruction. Additionally, as reported in #15869, if I kill it with -SIGPIPE then it also crashes with Illegal instruction.

cpuinfo of target machine:

Processor   : ARM926EJ-S rev 5 (v5l)
BogoMIPS    : 183.09
Features    : swp half thumb fastmult edsp java 
CPU implementer : 0x41
CPU architecture: 5TEJ
CPU variant : 0x0
CPU part    : 0x926
CPU revision    : 5
Cache type  : write-back
Cache clean : cp15 c7 ops
Cache lockdown  : format C
Cache format    : Harvard
I size      : 32768
I assoc     : 4
I line length   : 32
I sets      : 256
D size      : 32768
D assoc     : 4
D line length   : 32
D sets      : 256

Hardware    : Oxsemi NAS
Revision    : 0000
Serial      : 000007a0362c2ec0

Under gdb (I resized the window after starting it running):

(gdb) file loopTest
Reading symbols from /shares/Download/loopTest...done.
(gdb) r
Starting program: /shares/Download/loopTest 

Program received signal SIGILL, Illegal instruction.
0xffff0514 in ?? ()
(gdb) bt
#0  0xffff0514 in ?? ()
#1  0x0004ad2c in runtime.sigtrampgo (sig=Unhandled dwarf expression opcode 0x9c
)
    at /usr/local/bin/go/src/runtime/signal_sigtramp.go:57
#2  0x0000001c in ?? ()
Backtrace stopped: frame did not save the PC
bradfitz commented 7 years ago

No further development is happening on the Go 1.7 branch, except for security fixes.

Can you try Go 1.8?

minux commented 7 years ago

What's your kernel version? It's probably too old and doesn't provide needed atomic operation helpers.

RobHumphrey commented 7 years ago

Updating to master doesn't seem to have made any difference.

go version devel +81a61a9 Tue Jan 17 22:37:19 2017 +0000 linux/amd64

Kernel version is 2.6.24.4. What's the minimum required for ARM architecture, as this is newer than the version given on the Minimum Requirements page?

minux commented 7 years ago

At least 3.1 as we need __kuser_cmpxchg64.

The documented requirement of 2.6.23 is for x86 only, for ARMv5, we need much newer kernels.

minux commented 7 years ago

I couldn't find any use of the fault address (PC) in our source code though. Could you please run the problem under gdb and see where does the jmp to 0xffff0514 occurr? I'd like to understand more about the problem.

minux commented 7 years ago

That is, I doubt the direct caller of that address is as showed by gdb. Perhaps you can set a hardware breakpoint on *0xffff0514 and see if it can catch the real caller?

The address is very special, and it's not within the range for kernel helpers. Maybe the problem lies elsewhere.

minux commented 7 years ago

I tried to reproduce with qemu (-M versatilepb) with 2.6.32 kernel but the program runs fine and stops fine when interrupted with Ctrl-C.

gopherbot commented 7 years ago

CL https://golang.org/cl/35352 mentions this issue.

minux commented 7 years ago

For future reference: the kernel help page is in arch/arm/kernel/entry-armv.S (v2.6.24): https://git.kernel.org/cgit/linux/kernel/git/torvalds/ linux.git/tree/arch/arm/kernel/entry-armv.S?h=v2.6.24&id= 49914084e797530d9baaf51df9eda77babc98fa8

And it gets copied into the page mapped at 0xffff0000 by traps.c: https://git.kernel.org/cgit/linux/kernel/git/torvalds/ linux.git/tree/arch/arm/kernel/traps.c?h=v2.6.24&id= 49914084e797530d9baaf51df9eda77babc98fa8#n706

OK, I know where does 0xffff0514 comes from. It's the code kernel used to implement signal return. KERN_SIGRETURN_CODE is defined as (CONFIG_VECTORS_BASE + 0x00000500) by https://git.kernel.org/cgit/linux/kernel/git/torvalds/ linux.git/tree/arch/arm/kernel/signal.h?h=v2.6.24&id= 49914084e797530d9baaf51df9eda77babc98fa8#n10 (And CONFIG_VECTOR_BASE should be 0xffff0000)

The code there is in: https://git.kernel.org/cgit/linux/kernel/git/torvalds/ linux.git/tree/arch/arm/kernel/signal.c?h=v2.6.24&id= 49914084e797530d9baaf51df9eda77babc98fa8#n45

At offset 0x514, the instruction is a thumb instruction: SWI_THUMB_RT_SIGRETURN, which no doubt is an invalid instruction for ARM mode. I still don't understand why the kernel uses thumb version of sigreturn for us, because our code is definitely ARM, and if the kernel thinks our code is thumb, the process should have crashed before we even reach the RET instruction in runtime.sigtramp.

Perhaps a safe workaround is to do our own rt_sigreturn syscall in sigtramp, I will send a CL for this. @RobHumphrey, please give the CL a try. Thanks.

minux commented 7 years ago

The patch doesn't work yet, please hold on trying it.

It needs to restore the stack pointer before the sigreturn syscall, so essentially we need to go back to SA_RESTORER.

RobHumphrey commented 7 years ago

On the command line I don't have a problem stopping the executable with Ctrl+C either.

I have no experience using gdb, but hope some of the following is useful.

Adding a hardware breakpoint with hbreak *0xffff0514 fails with No hardware breakpoint support in the target., and adding a software breakpoint at this address gives the warning

Cannot insert breakpoint 1.
Error accessing memory address 0xffff0514: Input/output error.

Adding a breakpoint at main.main works. Neither resizing the window nor kill -SIGPIPE have any impact when I step again - it just remains in the infinite loop.

Using the code built from master gives

Program received signal SIGILL, Illegal instruction.
0xffff0514 in ?? ()
(gdb) bt
#0  0xffff0514 in ?? ()
#1  0x00043780 in runtime.sigtrampgo (sig=Unhandled dwarf expression opcode 0x9c
)
    at /home/rob/src/go/src/runtime/signal_unix.go:258
#2  0x0000001c in ?? ()
Backtrace stopped: frame did not save the PC

Adding a breakpoint at 0x43780 gives

Breakpoint 1, runtime.sigtrampgo (sig=Unhandled dwarf expression opcode 0x9c
)
    at /home/rob/src/go/src/runtime/signal_unix.go:259
259 in /home/rob/src/go/src/runtime/signal_unix.go
(gdb) bt
#0  runtime.sigtrampgo (sig=Unhandled dwarf expression opcode 0x9c
) at /home/rob/src/go/src/runtime/signal_unix.go:259
#1  0x0005cf50 in runtime.sigtramp ()
    at /home/rob/src/go/src/runtime/sys_linux_arm.s:360
#2  0xffff050c in ?? ()

I've been trying to find a way of seeing what's happening when the signal occurs. I can stop the code with handle SIGPIPE stop. However, it seems that once I've reached the tight loop then whenever I try to step then gdb automatically detects that we're in an infinite loop and so doesn't do anything. Does that sound likely? If I continue instead of step then the Illegal instruction occurs.

Is there a way to add something simple into the loop (eg an increment or nop) that won't get optimised away? Hopefully that would then stop gdb detecting the infinite loop, and so allow me to step and see what's happening.

minux commented 7 years ago

The patch is ready. @RobHumphrey please give it a try. Thanks.

If you haven't used Gerrit before, download this zip which contains a patch for the latest master branch. I think it should apply cleanly to Go 1.7.4 too. https://go-review.googlesource.com/changes/35352/revisions/3/patch?zip

minux commented 7 years ago

If you have time, I'd like to understand better why the kernel is giving us a bad returning address.

$ gdb ./loopTest (gdb) b *'runtime.sigtramp' (gdb) r // Make the process crash with SIGILL as usual, and // gdb should stop the process right when it enters // the function 'runtime.sigtramp' (gdb) info registers // Please paste the output generated by this command.

That is, I want the registers when the process hits 'runtime.sigtramp'. Please do this without using the patch. Specifically, I'm interested in the LR and CPSR register values.

If my theory was right, that this is a kernel bug, then LR should be 0xffff0514. If it's not, then something else is going on and my fix just papers over the real cause.

RobHumphrey commented 7 years ago

Dumping the registers as requested, still without the patch applied:

r0             0xd  13
r1             0x10309c60   271621216
r2             0x10309ce0   271621344
r3             0x0  0
r4             0x10312094   271655060
r5             0x38 56
r6             0x0  0
r7             0x758dc  481500
r8             0x3fffffff   1073741823
r9             0xffffffe1   4294967265
r10            0x103000f0   271581424
r11            0x758dc  481500
r12            0x0  0
sp             0x10309c60   0x10309c60
lr             0xffff050c   4294903052
pc             0x5cf28  0x5cf28 <runtime.sigtramp>
fps            0x0  0
cpsr           0x10 16

The LR contains the same value I saw on the back trace last night. Stepping through until it crashed, the final few instructions were:

0x0005bfc8  689 in /home/rob/src/go/src/runtime/asm_arm.s
0x5bfc8 <setg+8>:   ldrsb   r0, [r11]
690 in /home/rob/src/go/src/runtime/asm_arm.s
0x5bfcc <setg+12>:  cmp r0, #0  ; 0x0
691 in /home/rob/src/go/src/runtime/asm_arm.s
0x5bfd0 <setg+16>:  beq 0x5bfd8 <setg+24>
694 in /home/rob/src/go/src/runtime/asm_arm.s
0x5bfd8 <setg+24>:  mov r0, r10
695 in /home/rob/src/go/src/runtime/asm_arm.s
0x5bfdc <setg+28>:  add pc, lr, #0  ; 0x0
runtime.sigtrampgo (sig=Unhandled dwarf expression opcode 0x9c) at /home/rob/src/go/src/runtime/signal_unix.go:259
259 /home/rob/src/go/src/runtime/signal_unix.go: No such file or directory.
    in /home/rob/src/go/src/runtime/signal_unix.go
0x43780 <runtime.sigtrampgo+392>:   ldrb    r0, [sp, #20]
0x00043784  259 in /home/rob/src/go/src/runtime/signal_unix.go
0x43784 <runtime.sigtrampgo+396>:   cmp r0, #0  ; 0x0
0x00043788  259 in /home/rob/src/go/src/runtime/signal_unix.go
0x43788 <runtime.sigtrampgo+400>:   beq 0x437c4 <runtime.sigtrampgo+460>
262 in /home/rob/src/go/src/runtime/signal_unix.go
0x437c4 <runtime.sigtrampgo+460>:   ldr pc, [sp], #96
runtime.sigtramp () at /home/rob/src/go/src/runtime/sys_linux_arm.s:361
361 /home/rob/src/go/src/runtime/sys_linux_arm.s: No such file or directory.
    in /home/rob/src/go/src/runtime/sys_linux_arm.s
0x5cf50 <runtime.sigtramp+40>:  ldr pc, [sp], #16
Program received signal SIGILL, Illegal instruction.
0xffff0514 in ?? ()

Putting a breakpoint at 0x437c4 (two instructions before the crash), looking also at the registers and the data at the SP address:

Breakpoint 1, runtime.sigtrampgo (sig=Unhandled dwarf expression opcode 0x9c
)
    at /home/rob/src/go/src/runtime/signal_unix.go:262
262 /home/rob/src/go/src/runtime/signal_unix.go: No such file or directory.
    in /home/rob/src/go/src/runtime/signal_unix.go
(gdb) info registers
r0             0x0  0
r1             0x0  0
r2             0x0  0
r3             0x1  1
r4             0x10309c54   271621204
r5             0x10309c54   271621204
r6             0x10302000   271589376
r7             0x758dc  481500
r8             0x3fffffff   1073741823
r9             0xffffffe1   4294967265
r10            0x103000f0   271581424
r11            0xae36b  713579
r12            0x0  0
sp             0x10309bf0   0x10309bf0
lr             0x43780  276352
pc             0x437c4  0x437c4 <runtime.sigtrampgo+460>
fps            0x0  0
cpsr           0x60000010   1610612752
(gdb) x 0x10309bf0
0x10309bf0: 0x0005cf50
(gdb) si
runtime.sigtramp () at /home/rob/src/go/src/runtime/sys_linux_arm.s:361
361 /home/rob/src/go/src/runtime/sys_linux_arm.s: No such file or directory.
    in /home/rob/src/go/src/runtime/sys_linux_arm.s
(gdb) info registers
r0             0x0  0
r1             0x0  0
r2             0x0  0
r3             0x1  1
r4             0x10309c54   271621204
r5             0x10309c54   271621204
r6             0x10302000   271589376
r7             0x758dc  481500
r8             0x3fffffff   1073741823
r9             0xffffffe1   4294967265
r10            0x103000f0   271581424
r11            0xae36b  713579
r12            0x0  0
sp             0x10309c50   0x10309c50
lr             0x43780  276352
pc             0x5cf50  0x5cf50 <runtime.sigtramp+40>
fps            0x0  0
cpsr           0x60000010   1610612752
(gdb) x 0x10309c50
0x10309c50: 0xffff050c
(gdb) si

Program received signal SIGILL, Illegal instruction.
0xffff0514 in ?? ()
(gdb) info registers
r0             0x0  0
r1             0x0  0
r2             0x0  0
r3             0x1  1
r4             0x10309c54   271621204
r5             0x10309c54   271621204
r6             0x10302000   271589376
r7             0xad 173
r8             0x3fffffff   1073741823
r9             0xffffffe1   4294967265
r10            0x103000f0   271581424
r11            0xae36b  713579
r12            0x0  0
sp             0x10309c60   0x10309c60
lr             0x43780  276352
pc             0xffff0514   0xffff0514
fps            0x0  0
cpsr           0x60000010   1610612752

If you need more information than this then do let me know.

RobHumphrey commented 7 years ago

Realised I should have added that the patch does stop it crashing, even if it's not fixing quite the bug you thought was happening.

aclements commented 7 years ago

Ping @minux.

I'm surprised by this:

info registers
lr             0xffff050c   4294903052
...
0x5cf50 <runtime.sigtramp+40>:  ldr pc, [sp], #16
Program received signal SIGILL, Illegal instruction.
0xffff0514 in ?? ()

It looks to me like we very slightly stomped the saved LR somehow.

@RobHumphrey, thanks for all the debugging. If you can still reproduce this (sorry, I know it's been a while), could you redo your last experiment where you stopped right before the crash, but when you're on the ldr pc, [sp], #16 line, dump more of the stack frame with, e.g., x/24wx $sp. In particular, this instruction means we're restoring PC from what's saved at SP+16 (rather than SP, which is what you examined). It's mighty suspicious that the expected LR is at SP+0 at that point, though.

bradfitz commented 7 years ago

@aclements, this isn't by chance related to your "runtime: save r11 in ARM addmoduledata" 87a51a0787df797660b6462f2debb7188f50f249 fix, is it?

aclements commented 7 years ago

@aclements, this isn't by chance related to your "runtime: save r11 in ARM addmoduledata" 87a51a0 fix, is it?

It's possible it's related, but it's certainly not the same bug since the code modified in 87a51a0 only runs in dynamically linked Go libraries. But maybe we failed to follow the C ABI in some signal-related code.

gopherbot commented 7 years ago

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

DrPeterVC commented 6 years ago

I too have a problem running go compiled program running on a WD My Book World Edition (White Light) NAS box, I think my issue is identical. I realise this thread was closed - but was it ever resolved?

I can provide further details if requested.

aclements commented 6 years ago

@DrPeterVC, thanks for checking. Go ahead and open a new issue (with a link to this issue).