Open splinedrive opened 1 day ago
Thank you for the report! Do you have the technical capability to run this simulation in a Questa GUI (or possibly VCS or Verilator) and look for a root cause? The UART is in src/uncore/uartPC16660D. It responds normally for us during Linux boot in hardware and simulation. Perhaps you are using it in a different way that exercises a bug in the UART itself, or a bug related to a UART interrupt handler?
Hello Mr. Harris, I am a big fan of yours. I just managed to get it running on my Artix7 FPGA, and this is what I noticed—it happens within 10 minutes. There are many root causes: possibly the kernel driver (very unlikely), the UART (makes sense), but what I don’t understand is why the system reboots when I terminate tio.
I don’t have professional tools; I am a hobbyist who learned the basics from your two edX courses and even developed a Linux/XV6 SoC with the knowledge from those courses. Thank you very much.
This sounds like a tricky one to debug. I will try to reproduce on my end with the VCU108 board so we can have more debug signals. I'm very concerned about the reset.
Interesting this isn't reproducing on the vcu108. I tried playing around with various baud rates. I'm trying the Arty A7 now. If I had to guess you probably found a bug in a UART fifo and it's reporting that the transmit fifo is always full so rather than writing multiple bytes per interrupt it's writing 1 byte hence the slow down.
After 8 minutes, it slows down. I also tested it with an external power supply, but it didn't fix the issue.
I am able to reproduce on the Arty A7 but the VCU108 is not triggering the bug unfortunately. I'm working on an ILA script to debug right now.
That’s wacky that it depends on the FPGA.
On Dec 2, 2024, at 12:32 PM, Rose Thompson @.***> wrote:
I am able to reproduce on the Arty A7 but the VCU108 is not triggering the bug unfortunately. I'm working on an ILA script to debug right now.
— Reply to this email directly, view it on GitHub https://github.com/openhwgroup/cvw/issues/1170#issuecomment-2512753908, or unsubscribe https://github.com/notifications/unsubscribe-auth/AR4AA37CAWL37ZOAT5HBTVD2DS7XBAVCNFSM6AAAAABS3ZWVECVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMJSG42TGOJQHA. You are receiving this because you commented.
It's not too unreasonable to think it could depend on the FPGA because the two FPGAs have different hardware configuration. The Arty A7 has a 20Mhz clock and 256MiB DDR3 memory and the VCU108 has 50 MHz clock and 2GiB DDR4 memory. This inherently makes the timing of interrupts different so it's possible we just can't hit the bug on the VCU108.
It's also possible splinedrive's suggestion is correct and it's related to memory since the VCU108 has more.
@rosethompson
I deleted my comment about the memory, but that still seems the most plausible. Still, wait a while and try running find / for longer than 10 minutes. However, the xxd /dev/urandom command always reproduces the issue.
Interesting the uart's INTR bit is always high. This is causing the OS to take a trap back into the trap handler immediately on exiting starving all other processes.
Even more interesting the CPU is waiting on a wfi instruction while INTR is high.
But why does the system reboot when tio is closed?
That part I haven't reproduced. I've been using screen and it's not rebooting the CPU. What is the tio command you are using?
tio -m INLCRNL -o 1 /dev/serial/by-id/usb-Digilent_Digilent_USB_Device_210319AFED71-if01-port0 -b 115200
I think the problem is with either
Interesting. I've narrowed the failure down to this section of kernel code.
ffffffff801d3844 <plic_irq_eoi>:
ffffffff801d3844: 1141 addi sp,sp,-16
ffffffff801d3846: e422 sd s0,8(sp)
ffffffff801d3848: 0800 addi s0,sp,16
ffffffff801d384a: 0140000f fence w,o
ffffffff801d384e: 04cbd797 auipc a5,0x4cbd
ffffffff801d3852: 5ca7b783 ld a5,1482(a5) # ffffffff84e90e18 <plic_handlers+0x8>
ffffffff801d3856: 6518 ld a4,8(a0)
ffffffff801d3858: 0791 addi a5,a5,4
ffffffff801d385a: c398 sw a4,0(a5)
ffffffff801d385c: 6422 ld s0,8(sp)
ffffffff801d385e: 0141 addi sp,sp,16
ffffffff801d3860: 8082 ret
The sw normally clears the intInProgress bits but for some reason this is not happening. I'm trying to isolate if this because it's not being called at all or if if the stack pointer is corrupted. There are at least two threads using this function which is complicating debugging this. We can't just trigger on this instructions address.
A couple interesting things to note. The two threads accessing the above function only experience the failure if the ld at ffffffff801d3856 effective address is specific number of bytes apart. Sometimes the bug never bug never occurs. For example. The following runs of the same buildroot just different reboot...
Thread 1 reads 0xffffaf800682a220 Thread 2 reads 0xffffaf8006712e20 which are 0x117400 bytes apart and this never crashes (at least after about a hour).
Thread 1 reads 0xffffaf800681a220 Thread 2 reads 0xffffaf80066fae20 which are 4: 0x11F400 bytes apart and this does crash in less than 10 minutes.
After the slow down the second interesting thing emerges, only one thread executes the plic_ireq_eoi function. This explains why the vcu108 is never experiencing this bug.
I have a hypothesis. I bet the hptw messes up during an interrupt (or similar) and the address translation for the claim data which should we written to the plic gets corrupted.
After a large amount of output, the UART slows down, and commands like Ctrl-C or other controls stop working. Disconnecting the serial tool from the host triggers a reboot. You can trigger it by running xxd /dev/urandom and waiting. `` Screencast from 2024-12-02 17-17-13.webm