Closed RobHumphrey closed 7 years ago
No further development is happening on the Go 1.7 branch, except for security fixes.
Can you try Go 1.8?
What's your kernel version? It's probably too old and doesn't provide needed atomic operation helpers.
Updating to master doesn't seem to have made any difference.
go version devel +81a61a9 Tue Jan 17 22:37:19 2017 +0000 linux/amd64
Kernel version is 2.6.24.4. What's the minimum required for ARM architecture, as this is newer than the version given on the Minimum Requirements page?
At least 3.1 as we need __kuser_cmpxchg64.
The documented requirement of 2.6.23 is for x86 only, for ARMv5, we need much newer kernels.
I couldn't find any use of the fault address (PC) in our source code though. Could you please run the problem under gdb and see where does the jmp to 0xffff0514 occurr? I'd like to understand more about the problem.
That is, I doubt the direct caller of that address is as showed by gdb. Perhaps you can set a hardware breakpoint on *0xffff0514 and see if it can catch the real caller?
The address is very special, and it's not within the range for kernel helpers. Maybe the problem lies elsewhere.
I tried to reproduce with qemu (-M versatilepb) with 2.6.32 kernel but the program runs fine and stops fine when interrupted with Ctrl-C.
CL https://golang.org/cl/35352 mentions this issue.
For future reference: the kernel help page is in arch/arm/kernel/entry-armv.S (v2.6.24): https://git.kernel.org/cgit/linux/kernel/git/torvalds/ linux.git/tree/arch/arm/kernel/entry-armv.S?h=v2.6.24&id= 49914084e797530d9baaf51df9eda77babc98fa8
And it gets copied into the page mapped at 0xffff0000 by traps.c: https://git.kernel.org/cgit/linux/kernel/git/torvalds/ linux.git/tree/arch/arm/kernel/traps.c?h=v2.6.24&id= 49914084e797530d9baaf51df9eda77babc98fa8#n706
OK, I know where does 0xffff0514 comes from. It's the code kernel used to implement signal return. KERN_SIGRETURN_CODE is defined as (CONFIG_VECTORS_BASE + 0x00000500) by https://git.kernel.org/cgit/linux/kernel/git/torvalds/ linux.git/tree/arch/arm/kernel/signal.h?h=v2.6.24&id= 49914084e797530d9baaf51df9eda77babc98fa8#n10 (And CONFIG_VECTOR_BASE should be 0xffff0000)
The code there is in: https://git.kernel.org/cgit/linux/kernel/git/torvalds/ linux.git/tree/arch/arm/kernel/signal.c?h=v2.6.24&id= 49914084e797530d9baaf51df9eda77babc98fa8#n45
At offset 0x514, the instruction is a thumb instruction: SWI_THUMB_RT_SIGRETURN, which no doubt is an invalid instruction for ARM mode. I still don't understand why the kernel uses thumb version of sigreturn for us, because our code is definitely ARM, and if the kernel thinks our code is thumb, the process should have crashed before we even reach the RET instruction in runtime.sigtramp.
Perhaps a safe workaround is to do our own rt_sigreturn syscall in sigtramp, I will send a CL for this. @RobHumphrey, please give the CL a try. Thanks.
The patch doesn't work yet, please hold on trying it.
It needs to restore the stack pointer before the sigreturn syscall, so essentially we need to go back to SA_RESTORER.
On the command line I don't have a problem stopping the executable with Ctrl+C either.
I have no experience using gdb, but hope some of the following is useful.
Adding a hardware breakpoint with hbreak *0xffff0514
fails with No hardware breakpoint support in the target.
, and adding a software breakpoint at this address gives the warning
Cannot insert breakpoint 1.
Error accessing memory address 0xffff0514: Input/output error.
Adding a breakpoint at main.main works. Neither resizing the window nor kill -SIGPIPE
have any impact when I step again - it just remains in the infinite loop.
Using the code built from master gives
Program received signal SIGILL, Illegal instruction.
0xffff0514 in ?? ()
(gdb) bt
#0 0xffff0514 in ?? ()
#1 0x00043780 in runtime.sigtrampgo (sig=Unhandled dwarf expression opcode 0x9c
)
at /home/rob/src/go/src/runtime/signal_unix.go:258
#2 0x0000001c in ?? ()
Backtrace stopped: frame did not save the PC
Adding a breakpoint at 0x43780 gives
Breakpoint 1, runtime.sigtrampgo (sig=Unhandled dwarf expression opcode 0x9c
)
at /home/rob/src/go/src/runtime/signal_unix.go:259
259 in /home/rob/src/go/src/runtime/signal_unix.go
(gdb) bt
#0 runtime.sigtrampgo (sig=Unhandled dwarf expression opcode 0x9c
) at /home/rob/src/go/src/runtime/signal_unix.go:259
#1 0x0005cf50 in runtime.sigtramp ()
at /home/rob/src/go/src/runtime/sys_linux_arm.s:360
#2 0xffff050c in ?? ()
I've been trying to find a way of seeing what's happening when the signal occurs. I can stop the code with handle SIGPIPE stop
. However, it seems that once I've reached the tight loop then whenever I try to step then gdb automatically detects that we're in an infinite loop and so doesn't do anything. Does that sound likely? If I continue instead of step then the Illegal instruction occurs.
Is there a way to add something simple into the loop (eg an increment or nop) that won't get optimised away? Hopefully that would then stop gdb detecting the infinite loop, and so allow me to step and see what's happening.
The patch is ready. @RobHumphrey please give it a try. Thanks.
If you haven't used Gerrit before, download this zip which contains a patch for the latest master branch. I think it should apply cleanly to Go 1.7.4 too. https://go-review.googlesource.com/changes/35352/revisions/3/patch?zip
If you have time, I'd like to understand better why the kernel is giving us a bad returning address.
$ gdb ./loopTest (gdb) b *'runtime.sigtramp' (gdb) r // Make the process crash with SIGILL as usual, and // gdb should stop the process right when it enters // the function 'runtime.sigtramp' (gdb) info registers // Please paste the output generated by this command.
That is, I want the registers when the process hits 'runtime.sigtramp'. Please do this without using the patch. Specifically, I'm interested in the LR and CPSR register values.
If my theory was right, that this is a kernel bug, then LR should be 0xffff0514. If it's not, then something else is going on and my fix just papers over the real cause.
Dumping the registers as requested, still without the patch applied:
r0 0xd 13
r1 0x10309c60 271621216
r2 0x10309ce0 271621344
r3 0x0 0
r4 0x10312094 271655060
r5 0x38 56
r6 0x0 0
r7 0x758dc 481500
r8 0x3fffffff 1073741823
r9 0xffffffe1 4294967265
r10 0x103000f0 271581424
r11 0x758dc 481500
r12 0x0 0
sp 0x10309c60 0x10309c60
lr 0xffff050c 4294903052
pc 0x5cf28 0x5cf28 <runtime.sigtramp>
fps 0x0 0
cpsr 0x10 16
The LR contains the same value I saw on the back trace last night. Stepping through until it crashed, the final few instructions were:
0x0005bfc8 689 in /home/rob/src/go/src/runtime/asm_arm.s
0x5bfc8 <setg+8>: ldrsb r0, [r11]
690 in /home/rob/src/go/src/runtime/asm_arm.s
0x5bfcc <setg+12>: cmp r0, #0 ; 0x0
691 in /home/rob/src/go/src/runtime/asm_arm.s
0x5bfd0 <setg+16>: beq 0x5bfd8 <setg+24>
694 in /home/rob/src/go/src/runtime/asm_arm.s
0x5bfd8 <setg+24>: mov r0, r10
695 in /home/rob/src/go/src/runtime/asm_arm.s
0x5bfdc <setg+28>: add pc, lr, #0 ; 0x0
runtime.sigtrampgo (sig=Unhandled dwarf expression opcode 0x9c) at /home/rob/src/go/src/runtime/signal_unix.go:259
259 /home/rob/src/go/src/runtime/signal_unix.go: No such file or directory.
in /home/rob/src/go/src/runtime/signal_unix.go
0x43780 <runtime.sigtrampgo+392>: ldrb r0, [sp, #20]
0x00043784 259 in /home/rob/src/go/src/runtime/signal_unix.go
0x43784 <runtime.sigtrampgo+396>: cmp r0, #0 ; 0x0
0x00043788 259 in /home/rob/src/go/src/runtime/signal_unix.go
0x43788 <runtime.sigtrampgo+400>: beq 0x437c4 <runtime.sigtrampgo+460>
262 in /home/rob/src/go/src/runtime/signal_unix.go
0x437c4 <runtime.sigtrampgo+460>: ldr pc, [sp], #96
runtime.sigtramp () at /home/rob/src/go/src/runtime/sys_linux_arm.s:361
361 /home/rob/src/go/src/runtime/sys_linux_arm.s: No such file or directory.
in /home/rob/src/go/src/runtime/sys_linux_arm.s
0x5cf50 <runtime.sigtramp+40>: ldr pc, [sp], #16
Program received signal SIGILL, Illegal instruction.
0xffff0514 in ?? ()
Putting a breakpoint at 0x437c4 (two instructions before the crash), looking also at the registers and the data at the SP address:
Breakpoint 1, runtime.sigtrampgo (sig=Unhandled dwarf expression opcode 0x9c
)
at /home/rob/src/go/src/runtime/signal_unix.go:262
262 /home/rob/src/go/src/runtime/signal_unix.go: No such file or directory.
in /home/rob/src/go/src/runtime/signal_unix.go
(gdb) info registers
r0 0x0 0
r1 0x0 0
r2 0x0 0
r3 0x1 1
r4 0x10309c54 271621204
r5 0x10309c54 271621204
r6 0x10302000 271589376
r7 0x758dc 481500
r8 0x3fffffff 1073741823
r9 0xffffffe1 4294967265
r10 0x103000f0 271581424
r11 0xae36b 713579
r12 0x0 0
sp 0x10309bf0 0x10309bf0
lr 0x43780 276352
pc 0x437c4 0x437c4 <runtime.sigtrampgo+460>
fps 0x0 0
cpsr 0x60000010 1610612752
(gdb) x 0x10309bf0
0x10309bf0: 0x0005cf50
(gdb) si
runtime.sigtramp () at /home/rob/src/go/src/runtime/sys_linux_arm.s:361
361 /home/rob/src/go/src/runtime/sys_linux_arm.s: No such file or directory.
in /home/rob/src/go/src/runtime/sys_linux_arm.s
(gdb) info registers
r0 0x0 0
r1 0x0 0
r2 0x0 0
r3 0x1 1
r4 0x10309c54 271621204
r5 0x10309c54 271621204
r6 0x10302000 271589376
r7 0x758dc 481500
r8 0x3fffffff 1073741823
r9 0xffffffe1 4294967265
r10 0x103000f0 271581424
r11 0xae36b 713579
r12 0x0 0
sp 0x10309c50 0x10309c50
lr 0x43780 276352
pc 0x5cf50 0x5cf50 <runtime.sigtramp+40>
fps 0x0 0
cpsr 0x60000010 1610612752
(gdb) x 0x10309c50
0x10309c50: 0xffff050c
(gdb) si
Program received signal SIGILL, Illegal instruction.
0xffff0514 in ?? ()
(gdb) info registers
r0 0x0 0
r1 0x0 0
r2 0x0 0
r3 0x1 1
r4 0x10309c54 271621204
r5 0x10309c54 271621204
r6 0x10302000 271589376
r7 0xad 173
r8 0x3fffffff 1073741823
r9 0xffffffe1 4294967265
r10 0x103000f0 271581424
r11 0xae36b 713579
r12 0x0 0
sp 0x10309c60 0x10309c60
lr 0x43780 276352
pc 0xffff0514 0xffff0514
fps 0x0 0
cpsr 0x60000010 1610612752
If you need more information than this then do let me know.
Realised I should have added that the patch does stop it crashing, even if it's not fixing quite the bug you thought was happening.
Ping @minux.
I'm surprised by this:
info registers
lr 0xffff050c 4294903052
...
0x5cf50 <runtime.sigtramp+40>: ldr pc, [sp], #16
Program received signal SIGILL, Illegal instruction.
0xffff0514 in ?? ()
It looks to me like we very slightly stomped the saved LR somehow.
@RobHumphrey, thanks for all the debugging. If you can still reproduce this (sorry, I know it's been a while), could you redo your last experiment where you stopped right before the crash, but when you're on the ldr pc, [sp], #16
line, dump more of the stack frame with, e.g., x/24wx $sp
. In particular, this instruction means we're restoring PC from what's saved at SP+16 (rather than SP, which is what you examined). It's mighty suspicious that the expected LR is at SP+0 at that point, though.
@aclements, this isn't by chance related to your "runtime: save r11 in ARM addmoduledata" 87a51a0787df797660b6462f2debb7188f50f249 fix, is it?
@aclements, this isn't by chance related to your "runtime: save r11 in ARM addmoduledata" 87a51a0 fix, is it?
It's possible it's related, but it's certainly not the same bug since the code modified in 87a51a0 only runs in dynamically linked Go libraries. But maybe we failed to follow the C ABI in some signal-related code.
Timed out in state WaitingForInfo. Closing.
(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)
I too have a problem running go compiled program running on a WD My Book World Edition (White Light) NAS box, I think my issue is identical. I realise this thread was closed - but was it ever resolved?
I can provide further details if requested.
@DrPeterVC, thanks for checking. Go ahead and open a new issue (with a link to this issue).
I've been trying to get Syncthing working on a WD My Book World Edition (White Light) NAS box, but encountered illegal instructions. Whilst investigating I found issue #15869, which seemed very similar, so hopefully I can continue to help resolve this issue.
What version of Go are you using (
go version
)?What operating system and processor architecture are you using (
go env
)?What did you do?
Set up a simple infinite loop like:
Compile with:
Transfer the output across to the NAS box and run it.
If I resize the terminal window then it crashes with
Illegal instruction
. Additionally, as reported in #15869, if I kill it with-SIGPIPE
then it also crashes withIllegal instruction
.cpuinfo of target machine:
Under gdb (I resized the window after starting it running):