opencomputeproject / Time-Appliance-Project

Develop an end-to-end hypothetical reference model, network architectures, precision time tools, performance objectives and the methods to distribute, operate, monitor time synchronization within data center and much more...
MIT License
1.32k stars 101 forks source link

Loading driver cause system crash on version of Beta-V1 #95

Closed Satoshier closed 10 months ago

Satoshier commented 1 year ago

I am using the PCB version is Beta -V1.

When I reflash Factory_TimeCardOS.bin file from \Time-Appliance-Project-master\Time-Card\FPGA\Open-Source\Implementation\Xilinx\TimeCard\Binaries and executing modprobe, then system gets crashed, only reboot. (BTW, I don't make any changes on Vivado project)

Then I reflash Factory_TimeCard.bin file from \SOM\FPGA\Binaries and it works well.

It looks very wired, that is why? image

If I only reflash from SOM folder, I can not make any changes on Vivado, it will be lost funny.

thschaub commented 1 year ago

I can't reproduce your reported issue with the latest open-source version. Do you restart the PC after programming the SPI flash (cold start/complete power down)?

chaloz commented 11 months ago

I can't reproduce your reported issue with the latest open-source version. Do you restart the PC after programming the SPI flash (cold start/complete power down)?

I could reproduce this with CentOS, kernel 6.4, and Vivado 2023.1 (Vivado v2023.1 (64-bit) SW Build: 3865809 on Sun May 7 15:04:56 MDT 2023 IP Build: 3864474 on Sun May 7 20:36:21 MDT 2023 SharedData Build: 3865790 on Sun May 07 13:33:03 MDT 2023). SOM image works, image built with Open-Source implementation does not.

Similar with Ubuntu (Linux ocp 6.5.3 #1 SMP PREEMPT_DYNAMIC Mon Sep 18 10:41:37 IST 2023 x86_64 x86_64 x86_64 GNU/Linux) SOM image works, but Open-Source image built with Vivado 2023.1 does not. Kernel does not crash though, and I can see the following with lspcie:

18:00.0 Memory controller: Facebook, Inc. Device 0401                                                                                          
        Subsystem: Xilinx Corporation Device 0007                                                                                              
        Physical Slot: 1-1                                             
        Flags: bus master, fast devsel, latency 0                                                                                              
        Memory at f0000000 (32-bit, non-prefetchable) [size=32M]                                                                               
        Capabilities: [40] Power Management version 3                                                                                          
        Capabilities: [48] MSI: Enable- Count=1/32 Maskable- 64bit+                                                                            
        Capabilities: [60] Express Endpoint, MSI 00                                                                                            
        Capabilities: [100] Device Serial Number 00-00-00-00-00-00-00-00

Note that I have changed Device ID to 0x0401 to be sure my image is loaded. With SOM image the lspcie shows:

        18:00.0 Memory controller: Facebook, Inc. Device 0400
        Subsystem: Xilinx Corporation Device 0007
        Physical Slot: 1-1
        Flags: bus master, fast devsel, latency 0, IRQ 32
        Memory at f0000000 (32-bit, non-prefetchable) [size=32M]
        Capabilities: [40] Power Management version 3
        Capabilities: [48] MSI: Enable+ Count=1/32 Maskable- 64bit+
        Capabilities: [60] Express Endpoint, MSI 00
        Capabilities: [100] Device Serial Number 00-00-00-00-00-00-00-00
        Kernel driver in use: ptp_ocp
        Kernel modules: ptp_ocp"

Clearly MSI is somehow disabled for Open-Source version.

Can someone please explain what is the main difference between SOM version and the Open-Source version? Mainly any difference in the PCIe MSI settings?

thschaub commented 11 months ago

Yeah, well the with the Device ID 0x0401 the ptp_ocp driver is not loaded and the interrupts are not enabled.

Why do you have 0x0401 as device id? Did you changed it manually?

In the TCL script which creates the PCIe config it's not like that: https://github.com/opencomputeproject/Time-Appliance-Project/blob/b22096f427e6ec01ff5e1c719547a2e447580aaf/Time-Card/FPGA/Open-Source/Implementation/Xilinx/TimeCard/Bd/TimeCardBd.tcl#L613-L625

I guess the crash has maybe something to do with the driver and already pending interrupts or similar. Do you see any interrupts on the SOM version?

There are some small differences with the available modules and possible interrupt sources in the SOM and Open-Source version. PCIe config is the same.

chaloz commented 11 months ago

Thanks @thschaub for such a quick reply. Yeah, I've changed the Device ID to 0x0401 just to see that I have successfully loaded the new bit file. Thanks for pointing out that driver needs to have the ID changed accordingly. Now the card is visible with lspcie.

Another issue, same to #73 is that the driver only gets allocated 1 MSI IRQ. @thschaub would you be able to tell what might be the issue and/or where to look? I have this behaviour with SOM and also clean Open-Source compiled bit stream.

lspcie shows that 32 MSIs are enabled (see below), but driver only gets one IRQ after calling pci_alloc_irq_vectors

Capabilities: [48] MSI: Enable+ Count=1/32 Maskable- 64bit+
JohnHay commented 11 months ago

I'm not sure if I should start a different issue.

I have a v8 card and if I boot with a usb flash with ubuntu-23.04-live-server-amd64.iso with the closed TimeCard.bin firmware, it will boot ok, but if I load the TimeCardOS.bin firmware, the kernel will panic.

In both cases, I did a cold boot after changing the firmware. I have not tried other linux versions. Here is a screenshot of the panic. I tried a few times and it paniced every time.

image

I only have a single GNSS installed, if that makes a difference.

I'm actually trying to write a FreeBSD driver for the card, but am a bit stuck with the MSI interrupts causing panics there, so I thought to check how linux was doing. I started with the OS version, which paniced, and I thought I had a smoking gun, but then linux did not panic with the SOM version. My FreeBSD code panic with both, but it is probably something I do wrong. :-)

Any suggestions on debugging MSI interrupts on the TimeCard would be useful.

@chaloz while searching I saw that for more than one MSI interrupt on linux, you need a kernel option CONFIG_IRQ_REMAP.

chaloz commented 11 months ago

@JohnHay Thanks for the tip, I think I have this option enabled: image

@JohnHay I am no expert on kernel drivers, but what I would suggest is to start with only one interrupt being enabled by the driver, and go on from there. If it works with a single interrupt, enable two and so on.

The line where you allocate interrupts is here: https://github.com/opencomputeproject/Time-Appliance-Project/blob/b22096f427e6ec01ff5e1c719547a2e447580aaf/Time-Card/DRV/ptp_ocp.c#L4868 So force err = 1; after the line above.

JohnHay commented 11 months ago

@thschaub are the MsiIrq Ips the same for the SOM and Open-Source version?

I had a look at the Open-Source MsiIrq.vhd and noticed that MsiVectorWidth_DatIn is not being used, so I was wondering how the interaction with the PCIe would be in the case where the OS sets only 1 interrupt. Like we see in @chaloz example and I have also seen when booting with ubuntu and the SOM firmware.

In the AXI PCIe documentation I found (DS820 April 24, 2012), in the MSI Interrupt section there is a sentence: The bridge ignores any bits set on the MSI_Vector_Num input signal, if they are not allocated in the Message Control Register.

I'm not sure if that refers to the capability part of that register or the part where the OS sets the number of vectors it allows/allocated. Would that mean if MsiIrq still sends all the vectors, just their lowest bit will be used? So will all the even vectors cause the same interrupt? What will then happen to the odd numbered ones?

Is there a way for the driver to determine if there are pending interrupts before enabling it, other than going through each function on the card?

JohnHay commented 11 months ago

I made some progress, but still have some questions. I followed @chaloz's suggestion and made a stripped down driver that only allocated a single msi interrupt with a handler that does nothing. That did not panic, but I found that it was getting 1.5M interrupts per second. That was with a cold booted system and the driver did nothing on the card. So I wrote a program that read and displayed the interrupt enable registers of all the functions on the card. Nothing was enabled. I tried it with both the SOM and Open-Source firmware and they behaved the same.

Here is the output of the program with the Open-Source firmware. It used the the CoreList to get a list of all the functions and only looked at the ones with an irq that is not 0xffffffff. The number after the / is the interrupt sensitivity.

TC Sig Generator 1, (irq 11/2), CNTRL 0, IRQ 0, IRQM 0
TC Sig Generator 2, (irq 12/2), CNTRL 0, IRQ 0, IRQM 0
TC Sig Generator 3, (irq 13/2), CNTRL 0, IRQ 0, IRQM 0
TC Sig Generator 4, (irq 14/2), CNTRL 0, IRQ 0, IRQM 0
TC Sig Timestamper GNSS PPS, (irq 1/2), CNTRL 0, IRQ 0, IRQM 0
TC Sig Timestamper 1, (irq 2/2), CNTRL 0, IRQ 0, IRQM 0
TC Sig Timestamper 2, (irq 6/2), CNTRL 0, IRQ 0, IRQM 0
TC Sig Timestamper FPGA PPS, (irq 0/2), CNTRL 0, IRQ 0, IRQM 0
TC Sig Timestamper 3, (irq 15/2), CNTRL 0, IRQ 0, IRQM 0
TC Sig Timestamper 4, (irq 16/2), CNTRL 0, IRQ 0, IRQM 0
Xilinx AXI IIC, (irq 7/3), GIE 0
Xilinx AXI IIC CLOCK, (irq 5/3), GIE 0
Xilinx AXI UART GNSS 1, (irq 3/3), IER 0
Xilinx AXI UART GNSS 2, (irq 4/3), IER 0
Xilinx AXI UART MAC, (irq 5/3), IER 0
Xilinx AXI UART reserved, (irq 10/3), IER 0
Xilinx AXI UART ext, (irq 19/3), IER 0
Xilinx AXI HWICAP, (irq 8/3), GIER 0
Xilinx AXI Quad SPI flash, (irq 9/2), DGIER 0

So I changed the driver to allocate 2 msi interrupts and a handler for each. That did not crash either, but I saw that both interrupts now received around 900k interrupts per second. So I increased it in powers of 2 until it allocated 32 msi interrupts and it still did not panic. I think it is because I installed a handler for each interrupt. In my original driver I only installed a handler for the functions was busy with.

From where the driver allocated 4 msi interrupts, I could see that the interrupts came from irq 1 and 2, "timestamper gnss pps" and "timestamper 1".

irq38: timecard0                       0          0
irq39: timecard0              2029101185     842591
irq40: timecard0              2105163016     874176
irq41: timecard0                       0          0
irq42: timecard0                       0          0
...
irq69: timecard0                       0          0

@thschaub do you have an idea why it would happen? I started to look through the firmware on git, but have not found anything yet.

chaloz commented 11 months ago

@JohnHay, great to see you're making progress. Without seeing your code anything I recommend is just a shot in the dark, but here we go: are you clearing the interrupts in the interrupt handler?

Like so: https://github.com/opencomputeproject/Time-Appliance-Project/blob/b22096f427e6ec01ff5e1c719547a2e447580aaf/Time-Card/DRV/ptp_ocp.c#L2045

JohnHay commented 11 months ago

@chaloz, I made a lot of progress. I found the source of the interrupts. There are 2 unused interrupts (17 and 18) on the board. In TimeCardBd.tcl they are connected to xlconstant_5_dout, but xlconstant_5 is defined without giving it a value, so its default is 1. If they were edge interrupts, that would only cause 2 interrupts, but also in TimeCardBd.tcl, they are defined as level interrupts with this bitmask: CONFIG.LevelInterrupt_Gen {0x000E05B8}

@thschaub can xlconstant_5 not just be removed and TC_MsiIrq_0/IrqIn17_DatIn and TC_MsiIrq_0/IrqIn18_DatIn be left open? Looking at the MsiIrq code, it will use a 0 for IrqIn pins that are not connected.

Also should the level mask not be changed to 0x000805B8, ie. clearing bits 17 and 18? That seems safer, except if you already know what irq 17 and 18 will be used for in the future, or is there another reason to keep them as level interrupts?

The last part of the puzzle (for now) was the panics in FreeBSD. I found that there is a mistake in the function native_apic_alloc_vectors() that did not align it on a 32 boundary, but only on a 16 boundary. So msi allocations up to 16 were fine. If you allocate 32 msi vectors and they are not aligned, the axi pcie blob will just ignore the lower 5 bits and write over it. In my case the vector received from the OS was 0x50, so the axi pcie would ignore bit 4 and create vectors from 0x40 to 0x5f. The interrupts outside of the range caused the panics. The axi pcie implementation is correct though, because the pci spec says that the vector has to be aligned to the number of bits that are requested.

So now I can return to the driver I'm busy with. :-)

thschaub commented 11 months ago

@JohnHay

Good catch, the xlconstant_5 should be definitly '0'. The reason why we have not left them open and we have a defined mask is because on a special HW revision (production version) where this IRQs are used: https://github.com/opencomputeproject/Time-Appliance-Project/tree/master/Time-Card/FPGA/Open-Source/Implementation/Xilinx/TimeCard_Production

Left open is also somehow undefined, but driving it with a constant allows to optimize the logic. So left something open is usually not the proper way in VHDL.

I will inform you once this bug is fixed.

thschaub commented 10 months ago

@JohnHay

Constant value correctly defined in the updated BD.tcl scripts. You can find the update in following commit: https://github.com/opencomputeproject/Time-Appliance-Project/commit/a330ec4c33a1a760dd3cbf221079e561027fcce5