mstatus.mie reset when hitting breakpoint

MarekPikula commented 5 years ago

I'm currently evaluating PULPissimo platform with Ibex core and I encountered some strange behaviour with interrupts. It seems that after hitting breakpoint mie in mstatus register is reset to 0.

Here is my setup:

PULPissimo with Ibex core and pulp-platform/riscv-dbg as debug core,
ZedBoard (Xilinx Zynq, master version of all cores) and Terasic DE10-Nano (Altera Cyclone-V, with modifications to make the code compatible with Quartus),
Digilent HS2 JTAG programmer,
OpenOCD compiled basing on instructions from PULPissimo,
pulp-platform/pulp-riscv-gnu-toolchain.

OpenOCD config and log as well as gdb log (including issued commands) can be found here.

Code:

#include <stdint.h>

int main() {
  register int mstatus_mie = (1<<3); // mstatus.mie
  asm volatile ("csrw 0x300, %0" :: "r" (mstatus_mie) );

  uint32_t mstatus[3];
  asm volatile ("csrr %0, 0x300" : "=r" (mstatus[0]));
  asm volatile ("nop");  // First breakpoint

  asm volatile ("csrr %0, 0x300" : "=r" (mstatus[1]));
  asm volatile ("nop");  // Second breakpoint

  asm volatile ("csrw 0x300, %0" :: "r" (mstatus_mie) );
  asm volatile ("csrr %0, 0x300" : "=r" (mstatus[2]));
  asm volatile ("nop");  // Third breakpoint

  return 0;
}

What the code does is set mie flag in mstatus register (0x300) and store it in mstatus[0], then after hitting breakpoint read it to mstatus[1] and to confirm, mstatus register is written again and result stored in mstatus[2].

Basically after sending x/3x mstatus gdb returns: 0x00001808 0x00001800 0x00001808, so in mstatus[0] and mstatus[2] mie flag is set, whereas in mstatus[1] it is not set, although, to my understanding, it should be recovered after going out of debug mode.

Is it expected behaviour? If not what can I do to further test it?

vogelpi commented 5 years ago

Thanks @MarekPikula for opening this issue. I will look into it later today.

vogelpi commented 5 years ago

Hi @MarekPikula ,

I had a look into your issue. mstatus.MIE is actually not modified when entering debug mode and thus also does not need to be recovered when leaving debug mode.

It is not clear to me, why in your case MIE is set to 0 anyway. In theory, this could be caused both by hardware or software. The hardware sets MIE to 0 either when

executing an exception. In this case, it also modifies mtval, mepc and mcause. But most importantly, it would set mstatus.MPIE to hold the old value of MIE. This is not the case here. The corresponding bit (bit 7 of mstatus) is always 0.
executing an mret instruction. In this case, MIE just gets overwritten by MPIE, which is 0 (depending on the version of the sources, we recently changed it). However, I don't know why the core should execute an mret instruction here.

Thus, I more suspect that once in debug mode, the core executes a CSR write instruction that clears MIE in software. Would that be possible?

Other possible but unlikely hardware causes:

dcsr.EBREAKM is not set in which case the controller would not enter debug mode but just trigger a breakpoint exception. This deasserts mstatus.MIE but would not make the core connect to OpenOCD. This would actually also change MPIE. But maybe you can check dcsr?
The core gets another exception after entering debug mode which deasserts mstatus.MIE but which is not properly handled (no MRET, i.e. mstatus.MIE is not restored). I don't think that would work either and you could get some info on this from mret, mtval etc.

Anyway, what would really help is if you could reproduce the problem in simulation and send me the waves as VCD. I also would need to know the exact commit of Ibex you are at. We recently had a couple of changes in the controller and CSR files and I would like to make sure that I am looking at the right sources.

Best, Pirmin

MarekPikula commented 5 years ago

Here is the debug_rom.S from PULPissimo for reference. It doesn't modify mstatus for sure.

Today I rechecked with master on ZedBoard, because I noticed that version I tested had some small modifications, but still it produces the same result.

Unfortunately I don't have a license for QuestaSim, which would allow me to make the entire PULPissimo simulation with OpenOCD connection, but I could send you waveforms from Quartus SignalTap. Let me know which signals should I tap into for your best convenience.

For Xilinx I'm at pulpissimo master, so I'm at 13313952cd50ff04489f6cf3dba9ba05c2011a8b. For Intel I'm based on 2a947c5e7fa79ea951dabcf68a566cd147d853a3 with changes from #336 and merged into pulpissimo branch.

vogelpi commented 5 years ago

Thanks @MarekPikula for getting back and providing more details. I was looking at the same sources. I guess we have to get into Quartus SignalTap and Vivado ILA then.

We basically need:

the instruction in the decoder
the state of the controller
the signals making the controller jump into FLUSH state
the signals making the controller do what it does in the FLUSH state
mstatus.MIE, mstatus.MPIE

MarekPikula commented 5 years ago

Today I made some investigation. I assembled SignalTap to your request and added some additional signals like different PCs. It seems that correctly MIE is moved to MPIE on debug request.

MPIE is being reset on illegal_insn fetched from 0x1A110300, which happens to be whereto in DM (here you can find address mapping within the DM). whereto is used by DM to point the core to the next address, so it's always jal instruction (source here). From what I can see it wants to jump to address 0x1A114360, which is within DM's address space (0x1A11????), but not pointing to anything, thus reading 0x0 instruction, which is indeed illegal instruction.

Tomorrow I'll investigate further with the debug core, but I presume it's a debug core issue rather than ibex'.

Thanks for pointing me in the right direction.

vogelpi commented 5 years ago

Thanks @MarekPikula for getting back and providing more insight. Really interesting! I am wondering, what does the illegal instruction handler do?

MarekPikula commented 5 years ago

Ok, more or less I've got it figured out. For the out-of-range 0x1A114360 jump address it was just my error due to some changes in riscv-dbg to make it work with Quartus. Either way it didn't really matter if it was 0x4360 or 0x0360, because either way DM takes only 12 lower bits from address for internal decoding.

So whereto was pointing to 0x1A110360, which happens to be Program Buffer. The instruction, which generates illegal_insn is fence.i, which according to code in _ibexdecoder.sv is not supported and correctly generates illegal instruction.

The problem is that this instruction is neither from debug_rom.S nor is generated anywhere in debug module and is sent to Program Buffer from computer (traced to dmi_req in _dmcsrs.sv). I'll make some further investigation into this what is the specific origin of this instruction later.

From what I can see implementing Zifencei extension for Ibex wouldn't be that problematic, but I don't have too much time for this right now, so maybe shall I or you open an issue with a request to implement it?

Either way it's definitely not nice of gdb to issue an instruction without knowing if it's supported, but from what I can see availability of Zifencei isn't communicated anywhere in CSRs.

asb commented 5 years ago

Thanks for the investigation Marek. I think openocd will generate fence.i in a number of cases, which is probably good motivation to at least handle fence.i as a nop.

MarekPikula commented 5 years ago

From what I understand Zifencei wouldn't be that hard to implement – it would be just flushing instruction prefetch buffer and pipeline, am I correct?

asb commented 5 years ago

That would be better. I wasn't thinking about the need to handle the prefetch buffer, should have given it more thought.

MarekPikula commented 5 years ago

Yeah, so that's exactly what I have seen in Program Buffer.

For reference:

0x1a110360: 0x0000100f  0x0000000f  0x00100073  0x00000000

In assembly:

fence.i
fence
ebreak

GregAC commented 5 years ago

Indeed we'd want the prefetch buffer flush. Plain FENCE is currently implemented as a nop (which I think is correct, but haven't given the matter much thought).

Should be fairly simple, already got other things that need to flush the buffer so not much to do.

GregAC commented 5 years ago

I'd be happy to add an implementation, I'll open an issue to track the work

MarekPikula commented 5 years ago

Here's an interesting point about fencing: riscv/riscv-openocd#315.

imphil commented 5 years ago

I need to double-check, but from what I remember when porting the debug system: OpenOCD will instructions which cause illegal instruction exceptions, as a way to feature-test. These need to be properly handled by the core and the debug system (all exceptions in debug mode go to the debug module). I'm pretty certain that worked last time I looked. I'll get back to you when I have time to have a closer look some time next week.

MarekPikula commented 5 years ago

Can you look at the issue I submitted on OpenOCD (riscv/riscv-openocd#415)? Supposedly:

The spec says in the Debug Mode section: "Exceptions don’t update any registers. That includes cause, epc, tval, dpc, and mstatus. They do end execution of the Program Buffer."

MarekPikula commented 5 years ago

@imphil it indeed works when hartid=0. Until now I used Ibex with hartid=0x20, as it is supposedly advised not to use hartids below 32 for current riscv-dbg. Now I started testing if it's really the case and indeed after connecting with OpenOCD I've got:

Error: Failed to execute fence.i

I'm guessing something is goofed in OpenOCD, possibly connected with riscv/riscv-openocd#359.

Either way, for now I'm making fence.i a nop waiting for #391.

MarekPikula commented 5 years ago

I just realised how much easier it would be if I had this message in the first place :man_facepalming:

GregAC commented 5 years ago

Ok to to summarize:

Something breaks when you use hartid=0x20?
OpenOCD does indeed test if fence.i is supported but if it isn't you get an error?

So we do want fence.i support and should also look into why 1. is hapenning?

MarekPikula commented 5 years ago

Nothing breaks if hartid=0x20 in Ibex. The only thing that is unreliable is OpenOCD, which doesn't report fence.i fail, although the core itself reports command error.
If it isn't supported I've got an error and debugging stack is pretty much unusable. I've been testing it under Eclipse and it was not possible to debug anything, as OpenOCD didn't want to cooperate after this error.

GregAC commented 5 years ago

I have discovered the ebreak instruction is also broken when using debug mode, given the PULP RISCV debug module uses this in various places I suspect this totally breaks any debug setup that uses it. I've opened an issue to track this here: https://github.com/lowRISC/ibex/issues/393 will hopefully have a PR with a fix Monday.

FENCE.I implementation may not be required, potentially still something that would be useful to add though.

MarekPikula commented 5 years ago

I forgot to mention that after merging #398 and making fence.i a nop the original issue is gone and the core is working on current master.

vogelpi commented 5 years ago

I think the original problem here is solved. There are some follow-up tasks tracked in separate issues:

Exceptions in debug mode should not have an effect on mstatus and some other CSRs: lowRISC/ibex#168
Add support for FENCE.I in Ibex: lowRISC/ibex#391
OpenOCD should test if a hard actually has support for FENCE.I an not issue it if not: riscv/riscv-openocd#415
OpenOCD does not like non-contiguous hartids and the first hartid shoud be 0: riscv/riscv-openocd#359. Because of this problem, the PULP team recommends to use a patched version of OpenOCD for PULPissimo, see here: https://github.com/pulp-platform/pulpissimo/blob/master/README.md#GDB-and-OpenOCD

I suggest to close this issue. WDYT?

MarekPikula commented 5 years ago

The patched version of OpenOCD does nothing more than rise maximum number of harts to 1024, so it doesn't really solve the issue. The patched version doesn't report missing fence.i for harts greater than 0.

lowRISC / ibex

mstatus.mie reset when hitting breakpoint #365