Simultaneous interrupts

jrrk commented 5 years ago

On Ariane, the CPU ceases to respond to interrupts after a while, and I am testing the hypothesis that it is the simultaneous arrival of multiple interrupts that is causing the problem. I attach a waveform captured from the hardware under this circumstance for further study. To save memory, only the portion of the waveform with req_i.valid high is captured. Let me know if an alternative arrangement would be more useful.

plic.zip

@zarubaf @eunchan @asb

eunchan commented 5 years ago

@jrrk the entire waveform shows that PLIC generates correct eip to the target from 0 to the end of the waveform.

You mentioned that the CPU stop responding to the interrupt (I assume it doesn't claim and complete). To help me understand clear, could you please share the interrupt handling code? I am worrying that it might happen to clear mie and never turn it on at the corner case.

jrrk commented 5 years ago

vmlinux.dis.zip

This file should have everything you need. The CPU sits in wait_for_interrupt() when idle, the handler begins in do_IRQ(). The source code is inline, but if needed, is available at linux-4.20-rc2, with driver patches here: https://github.com/lowRISC/ariane-sdk/blob/rng-tools/configs/0099-lowrisc-ethernet.patch

jrrk commented 5 years ago

I don't think the mie theory can be correct because the CPU continues to receive Ethernet interrupts after the MMC has locked up. Unbalanced claim and complete statements perhaps could cause the same issue.

eunchan commented 5 years ago

@zarubaf Could you please show the connection in Ariane? I am seeing bit 1 and bit 2 of irq_sources_i were set but don't know which bit is for which module (MMC, ethernet).

In the above waveform, bit 1 interrupt is asserted following bit 2. So irq_sources_i was 0x6. This is the first & last interrupt for bit 2 (ID 3h). It is claimed and completed correctly. And never new interrupt for bit2 arrived.

The behavior looks correct but I doubt that the core can complete interrupt routine that quickly. It took just a cycle to complete ISR. Is the waveform from dedicated simulation environment not using the linux image attached above?

jrrk commented 5 years ago

The waveform only shows the cycles where the PLIC is accessed. There can be a delay of thousands of cycles in between.

eunchan commented 5 years ago

I see. It was long time ago to seeing VCD. OK then, I will wait until Florian confirms the connection. Based on the source at Ariane ( https://github.com/pulp-platform/ariane/blob/master/tb/ariane_peripherals.sv ) It looks like bit 1 is for SPI and bit 2 is for Ethernet. So I cannot connect the relationship between the waveform and the interrupt sources.

zarubaf commented 5 years ago

I think the trace is from a version of @jrrk, not the upstream one. @jrrk can you please shine some light on what is connected where?

jrrk commented 5 years ago

My version here

eunchan commented 5 years ago

OK then bit1 is MMC. The waveform shows it kept receiving the interrupt from bit 1. last beat of the waveform shows it completes the interrupt. Sorry but I cannot find anything wrong inside PLIC.

jrrk commented 5 years ago

As you noticed bit 0 is uart, bit1 is SD and bit 2 is Ethernet. Since that waveform has been uploaded I have generated others (attached). The behaviour might be more interesting (i.e. faulty)

plic2.zip

jrrk commented 5 years ago

Screenshot from ilaplic3 vcd 2019-04-04 19-18-50

This may be observed by opening ilaplic3.vcd in gtkwave and the previously provided simult.gtkw and scrolling to the extreme right (6157 samples). Here we can see that Ethernet interrupts are being serviced but SD interrupts are not, even though SD is asking for service. Since you have a trace of what registers have been written I would hope we can establish if this is a software or hardware bug.

eunchan commented 5 years ago

I see the value of eip_targets_o, ip are changed to 0 at the end of ilaplic3.vcd. Meaning, it is sucessfully claimed but not completed. Indeed it looks like a software bug.

$$var reg 2 ') i_ariane_peripherals/i_plic/eip_targets_o [1:0] $end 

... SIM ongoing ...

#6157                                                                                                                                                                                            
b0 ?#
b0 2%
b0 /&
b0 O&
b0 ')

jrrk commented 5 years ago

@zarubaf OK, this seems to be race hazard in the specification of the PLiC itself. Because some bright spark made reading the claim register self-clearing, there is the possibility that an interrupt may be claimed and appear on the local register bus, before being accepted by the AXI interface. I'm speculating that this could happen if an instruction is issued but not committed due to an interrupt or other exception. This theory could be tested by making the PLIC claim routine a critical section if it isn't already. If somebody can tell me I am wrong about this, that would be a relief.

asb commented 5 years ago

I'm not sure I follow the theory. If an issued but uncommitted instruction generates read/writes on a memory mapped peripheral that would be a processor bug surely.

Could you elaborate on what you mean by the "local register bus". My understanding is that the software will:

Read the memory-mapped claim register. If non-zero, they successfully claimed an interrupt.
Service the interrupt
Signal completion by writing the claimed id back to the memory-mapped claim register

jrrk commented 5 years ago

By definition all the data for a load instruction must be available before it can be committed. Therefore the interrupt that has been claimed will be cancelled at the PLIC end before the instruction is committed. Therefore memory-mapped peripherals that have side-effects on read are deprecated in modern multiple-issue environments.

asb commented 5 years ago

I meant if the instruction is issued, uncommitted, and is never committed.

zarubaf commented 5 years ago

@jrrk Yes, your observation is absolutely correct. The problem is that the claim register is not idempotent but the core issues loads speculatively. Ariane does not implement PMAs which I think should govern that. Quickly discussed a fix with @msfschaffner, but this will probably take a bit longer than I can afford today.

zarubaf commented 5 years ago

Thanks a ton for all your hard debugging work!

asb commented 5 years ago

It's actually the idempotency PMA which Ariane needs to implement and respect (either set dynamically in some platform defined way, or fixed at design time). This is separate to the PMPs.

jrrk commented 5 years ago

Even if PMAs were implemented, loads are at the start of the pipeline and commits are at the end so there is a vulnerability period of several cycles every time you check for an interrupt. Any chance that using an atomic instruction could help ?

zarubaf commented 5 years ago

Yeah, I meant PMAs. Always confusing this with PMPs. I've changed it in the comment above. @jrrk Yeah it would mean some hacky way around and disabling all sources of speculation (shortly disabling the interrupts and draining the pipeline before issuing the load). I also thought about using atomics for this, but then that would mean that the plic needs to support atomic transactions (not super difficult to implement). Tbh, that is a rather messy situation the hardware has to deal with, I think the "RISC-V" way would be to support PMAs in the core.

asb commented 5 years ago

Given the prevalence of existing peripherals with side-effects on loads, and that whenever the Unix platform spec actually exists it's almost certain to require support for side-effecting MMIO loads via idempotency PMAs or by never issuing loads speculatively, I think supporting the PMAs is probably the sensible path forwards.

Shall we close this issue for now given that the current hypothesis is that this isn't actually a PLIC problem?

jrrk commented 5 years ago

I think we need to demonstrate that if we work around this issue by whatever means, then all is well and there are no other lurking reliability issues. Perhaps the previous PLIC failed for exactly the same reason.

jrrk commented 5 years ago

Would this idea work for Ariane: convert loads to atomic loads if an address below the cacheable threshold is detected, and upgrade peripherals to support the atomic extensions? I believe we should observe occasionally missed characters on the UART for the same reason. I have already converted it from VHDL to Verilog to make life easier for Verilator.

Sent from my iPhone

On 5 Apr 2019, at 15:49, Florian Zaruba notifications@github.com wrote:

Yeah, I meant PMAs. Always confusing this with PMPs. I've changed it in the comment above. @jrrk Yeah it would mean some hacky way around and disabling all sources of speculation (shortly disabling the interrupts and draining the pipeline before issuing the load). I also thought about using atomics for this, but then that would mean that the plic needs to support atomic transactions (not super difficult to implement). Tbh, that is a rather messy situation the hardware has to deal with, I think the "RISC-V" way would be to support PMAs in the core.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

jrrk commented 5 years ago

@zarubaf how about this for an idea to keep us going, generate a handshake pulse from the commit stage of Ariane for the specific case of a non-idempotent load (at first just recognising the address), and feed this via the user signal over AXI bus to the PLIC, which can then be modified to only mask the interrupt when it gets the acknowledgement.

zarubaf commented 5 years ago

I am very (very) much in favor of cleaning it up completely e.g. pretty much as Chris explained in his answer. It is really interesting that I never seem to saw a missed character though. I hope I can prioritize this. Unfortunately, we have a conference deadline this week.

jrrk commented 5 years ago

@zarubaf The Ethernet typically goes wrong with a half-life of about 10MBytes. We aren't seeing that degree of traffic on the UART, and you might not even notice at that level. I have seen the UART lockup completely by the same PLIC mechanism though.

jrrk commented 5 years ago

Another theory to be investigated, perhaps the claiming of the interrupt is directly related to timing of the next interrupt. If we add a deliberate delay in the PLIC before asserting interrupts again, it might improve safety.

jrrk commented 5 years ago

I notice that reads from the PLIC under Linux are qualified afterward with

fence i,r At the moment Ariane does not decode this combination and interprets it as a plain fence (flushing the D$). Can we use this to safely do a pipeline flush instead, and would that help ?

jrrk commented 5 years ago

I have implemented the fence i,r as an acknowledge to the PLIC. Of course this is a temporary hack. The Debian chroot works quite well now. The SD-Card interface still locks up after about 46000 interrupts, but I think this is a driver problem and nothing to do with the PLIC. Onward and upward.

zarubaf commented 5 years ago

I am glad to hear that it is working better for you now. Sorry for my radio silence. I've been implementing the non-speculative loads (here: https://github.com/pulp-platform/ariane/pull/213). I think it will still require some bug hunting as the change was unfortunately quite intrusive.

lowRISC / rv_plic

Simultaneous interrupts #2