Possible error in 'jal' insn implementation

sergeysmolov commented 6 years ago

The current implementation for 'jal' instruction shows strange behavior. When I emulate the following assembler code on QEMU:

.text .org 0x1000

.globl _start _start: nop nop

la s1, jal_label nop lui t3, 0x8 auipc t4, 0x104c nop jal t5, 0x4 nop

it ceases the execution with the following error:

Trying to execute code outside RAM or ROM at 0x0000000000000004

The emulation stops after 'jal' instruction execution. I've added some debug print into "gen_jal" function and it seems that ctx->pc field is 0 here. When I've substituted the following line:

gen_goto_tb(ctx, 0, ctx->pc + imm);

into

ctx->pc = ctx->next_pc;
gen_goto_tb(ctx, 0, ctx->pc);

the execution has continued. Probably, my change is locally-correct but not globally. I hope that the current implementation for 'jal' will be fixed properly.

bkoppelmann commented 6 years ago

Thanks for the bug report, however I cannot reproduce it here. How do you link your binary? And what is jal_label? Cheers, Bastian

michaeljclark commented 6 years ago

@sergeysmolov Can you disassemble your binary?

It appears from the asm and the message that the code is trying to jump to 0x4 and this would appear to be the behaviour because the riscv port doesn't implement the do_unassigned_access hook which is called by cpu_unassigned_access just before the error_report in ./accel/tcg/cputlb.c.

If we implement the do_unassigned_access hook we can raise a fault fetch. The issue is that if we implement riscv_cpu_unassigned_access, then we need to get access to the current program counter. We should be able to call GETPC() in riscv_cpu_unassigned_access but we need the PC to point to the JAL instruction.

We should raise another issue to implement riscv_cpu_unassigned_access

michaeljclark commented 6 years ago

We actually fixed some bugs which result in a changed behaviour. i.e. cpu_unassigned_access. 0x0 in earlier riscv-qemu versions was erroneously mapped as ROM on some of the boards, whereas it should be Debug. I'll have to sift through the changes to find the particular commit id...

michaeljclark commented 6 years ago

Ideally post an executation trace of your program with -d in_asm,op,op_opt,out_asm, then we can see exactly what is happening...

sergeysmolov commented 6 years ago

Here is the assembler program called "instruction.s" (it does not have any algorithm, just a sequence of commands):

.text .org 0x1000

.globl _start _start: nop nop

test: nop nop

la s1, jal_label nop lui t3, 0x8 auipc t4, 0x104c nop jal t5, 0x4 nop j j_label nop nop nop j_label: nop jal t5, jal_label nop jal_label: nop lui s0, 0x1

success: error: nop nop nop nop nop nop wfi

Here is the bootloader called "boot.s":

.text
.org 0x380
nop
csrr ra, uepc
addi ra, ra, 4
ret

Here is the linker script called "link.ld":

ENTRY(_start) SECTIONS { . = 0x0000; .text : { (.text) } . = 0x80000; .data : { (.data) } .bss : { *(.bss COMMON) } . = ALIGN(8); . = . + 0x10000; stack_top = .; }

The compilation is as follows: /opt/riscv/riscv64-unknown-linux-gnu/bin/as instruction.s -o instruction.o /opt/riscv/riscv64-unknown-linux-gnu/bin/as boot.s -o boot.o /opt/riscv/riscv64-unknown-linux-gnu/bin/ld instruction.o boot.o -T link.ld -o instruction.elf

I run the ELF file on QEMU: qemu-system-riscv64 -M spike_v1.10 -cpu any -d in_asm,op,op_opt,out_asm -nographic -singlestep -D log.txt -kernel instruction.elf

and I get the error above. I've attached the "log.txt" file. log.txt

sergeysmolov commented 6 years ago

The log file contains calls to my own "trace_reg_access" function. You may ignore them - it is no more than a debug print .

bkoppelmann commented 6 years ago

I think the problem is the asm here. You assume for jal t5, 0x4 that the immediate field is 0x4, such that the effective address will be pc + 0x4. However, the asm assumes that you want to jump to the effective address 0x4. Therefor the immediate of jal is -4132, which you can confirm in the IN: disasm of your logfile. Thus you end up at address 0x4 :)

sergeysmolov commented 6 years ago

However, the asm assumes that you want to jump to the effective address 0x4.

Why do you think so? It is not clear for me why does the "-4132" value appear here.

Moreover, if we look at "gen_jal" function (target/riscv/translate.c), we will see the following:

gen_goto_tb(ctx, 0, ctx->pc + imm);

When I run my program on QEMU, ctx->pc is equal to 0 (and imm is 0x4) here at the time when the "jal" instruction is simulated. How could this fact be explained?

bkoppelmann commented 6 years ago

Why do you think so? It is not clear for me why does the "-4132" value appear here.

-4132 is the offset from the pc encoded in the instruction. -4132 is 0xffffefdc and if you look into the instruction encoding of the offending jal, e.g. with riscv64-objdump, you'll see 0xfddfef6f where the most significant 20 bits form the offset. I see the an awful lot of 0xf's which cannot be 0x4. If you now take this offset 0xffffefdc and add the value of the PC of the offending jal instruction you'll get 0x4, i.e. 0xffffefdc + 0x1028 = 0x4. Therefor, if you specify jal t5, 0x4 the assembler generate a jump to the address 0x4. I think the correct solution would be to add another label, i.e.

jal t5, Lfoo Lfoo: nop j j_label nop nop

When I run my program on QEMU, ctx->pc is equal to 0 (and imm is 0x4) here at the time when the "jal" instruction is simulated

Huh? This is really strange. Can you send me the binary so that I can try it out in gdb?

michaeljclark commented 6 years ago

The # 0x4 in the disassembly is the resolved PC-relative address. The disassembler resolves the address relative to the program counter. 0x1028 (the PC) is 4136 and when we add -4132 we get 4:

IN: 
0x0000000000001028:  fddfef6f          jal             t5,-4132        # 0x4

QEMU is doing the right thing here. You might have an issue in your debug statement that prints the program counter because it can't be zero.

riscvarchive / riscv-qemu

Possible error in 'jal' insn implementation #128