openhwgroup / cvw

CORE-V Wally is a configurable RISC-V Processor associated with RISC-V System-on-Chip Design textbook. Contains a 5-stage pipeline, support for A, B, C, D, F, M and Q extensions, and optional caches, BP, FPU, VM/MMU, AHB, RAMs, and peripherals.
Other
278 stars 198 forks source link

UART slows down #1170

Open splinedrive opened 1 day ago

splinedrive commented 1 day ago

After a large amount of output, the UART slows down, and commands like Ctrl-C or other controls stop working. Disconnecting the serial tool from the host triggers a reboot. You can trigger it by running xxd /dev/urandom and waiting. `` Screencast from 2024-12-02 17-17-13.webm

splinedrive commented 1 day ago

Screencast from 2024-12-02 17-17-13.webm

davidharrishmc commented 1 day ago

Thank you for the report! Do you have the technical capability to run this simulation in a Questa GUI (or possibly VCS or Verilator) and look for a root cause? The UART is in src/uncore/uartPC16660D. It responds normally for us during Linux boot in hardware and simulation. Perhaps you are using it in a different way that exercises a bug in the UART itself, or a bug related to a UART interrupt handler?

splinedrive commented 1 day ago

Hello Mr. Harris, I am a big fan of yours. I just managed to get it running on my Artix7 FPGA, and this is what I noticed—it happens within 10 minutes. There are many root causes: possibly the kernel driver (very unlikely), the UART (makes sense), but what I don’t understand is why the system reboots when I terminate tio.

I don’t have professional tools; I am a hobbyist who learned the basics from your two edX courses and even developed a Linux/XV6 SoC with the knowledge from those courses. Thank you very much.

https://github.com/splinedrive/kianRiscV

rosethompson commented 1 day ago

This sounds like a tricky one to debug. I will try to reproduce on my end with the VCU108 board so we can have more debug signals. I'm very concerned about the reset.

rosethompson commented 1 day ago

Interesting this isn't reproducing on the vcu108. I tried playing around with various baud rates. I'm trying the Arty A7 now. If I had to guess you probably found a bug in a UART fifo and it's reporting that the transmit fifo is always full so rather than writing multiple bytes per interrupt it's writing 1 byte hence the slow down.

splinedrive commented 1 day ago

After 8 minutes, it slows down. I also tested it with an external power supply, but it didn't fix the issue.

rosethompson commented 1 day ago

I am able to reproduce on the Arty A7 but the VCU108 is not triggering the bug unfortunately. I'm working on an ILA script to debug right now.

davidharrishmc commented 1 day ago

That’s wacky that it depends on the FPGA.

On Dec 2, 2024, at 12:32 PM, Rose Thompson @.***> wrote:

I am able to reproduce on the Arty A7 but the VCU108 is not triggering the bug unfortunately. I'm working on an ILA script to debug right now.

— Reply to this email directly, view it on GitHub https://github.com/openhwgroup/cvw/issues/1170#issuecomment-2512753908, or unsubscribe https://github.com/notifications/unsubscribe-auth/AR4AA37CAWL37ZOAT5HBTVD2DS7XBAVCNFSM6AAAAABS3ZWVECVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMJSG42TGOJQHA. You are receiving this because you commented.

rosethompson commented 1 day ago

It's not too unreasonable to think it could depend on the FPGA because the two FPGAs have different hardware configuration. The Arty A7 has a 20Mhz clock and 256MiB DDR3 memory and the VCU108 has 50 MHz clock and 2GiB DDR4 memory. This inherently makes the timing of interrupts different so it's possible we just can't hit the bug on the VCU108.

It's also possible splinedrive's suggestion is correct and it's related to memory since the VCU108 has more.

splinedrive commented 1 day ago

@rosethompson

I deleted my comment about the memory, but that still seems the most plausible. Still, wait a while and try running find / for longer than 10 minutes. However, the xxd /dev/urandom command always reproduces the issue.

rosethompson commented 1 day ago

Interesting the uart's INTR bit is always high. This is causing the OS to take a trap back into the trap handler immediately on exiting starving all other processes.

rosethompson commented 1 day ago

Even more interesting the CPU is waiting on a wfi instruction while INTR is high.

splinedrive commented 1 day ago

But why does the system reboot when tio is closed?

rosethompson commented 1 day ago

That part I haven't reproduced. I've been using screen and it's not rebooting the CPU. What is the tio command you are using?

splinedrive commented 1 day ago
tio -m INLCRNL -o 1 /dev/serial/by-id/usb-Digilent_Digilent_USB_Device_210319AFED71-if01-port0 -b 115200
rosethompson commented 18 hours ago

I think the problem is with either

  1. How the driver is claiming the external interrupt. The plic's intIntProgress bit 10 (UART interrupt) never goes low.
  2. Or the hardware has a bug which has caused the above condition to occur and the hardware/software has no way to lower intIntProgress.
rosethompson commented 3 hours ago

Interesting. I've narrowed the failure down to this section of kernel code.

ffffffff801d3844 <plic_irq_eoi>:
ffffffff801d3844:   1141                    addi    sp,sp,-16
ffffffff801d3846:   e422                    sd  s0,8(sp)
ffffffff801d3848:   0800                    addi    s0,sp,16
ffffffff801d384a:   0140000f            fence   w,o
ffffffff801d384e:   04cbd797            auipc   a5,0x4cbd
ffffffff801d3852:   5ca7b783            ld  a5,1482(a5) # ffffffff84e90e18 <plic_handlers+0x8>
ffffffff801d3856:   6518                    ld  a4,8(a0)
ffffffff801d3858:   0791                    addi    a5,a5,4
ffffffff801d385a:   c398                    sw  a4,0(a5)
ffffffff801d385c:   6422                    ld  s0,8(sp)
ffffffff801d385e:   0141                    addi    sp,sp,16
ffffffff801d3860:   8082                    ret

The sw normally clears the intInProgress bits but for some reason this is not happening. I'm trying to isolate if this because it's not being called at all or if if the stack pointer is corrupted. There are at least two threads using this function which is complicating debugging this. We can't just trigger on this instructions address.

rosethompson commented 2 hours ago

A couple interesting things to note. The two threads accessing the above function only experience the failure if the ld at ffffffff801d3856 effective address is specific number of bytes apart. Sometimes the bug never bug never occurs. For example. The following runs of the same buildroot just different reboot...

  1. Thread 1 reads 0xffffaf800682a220 Thread 2 reads 0xffffaf8006712e20 which are 0x117400 bytes apart and this never crashes (at least after about a hour).

  2. Thread 1 reads 0xffffaf800681a220 Thread 2 reads 0xffffaf80066fae20 which are 4: 0x11F400 bytes apart and this does crash in less than 10 minutes.

After the slow down the second interesting thing emerges, only one thread executes the plic_ireq_eoi function. This explains why the vcu108 is never experiencing this bug.

rosethompson commented 2 hours ago

I have a hypothesis. I bet the hptw messes up during an interrupt (or similar) and the address translation for the claim data which should we written to the plic gets corrupted.