bus error on deployment of an action that passes pslse

diamantopoulos commented 5 years ago

Dear team,

I'm facing the following problem: I'm developing an action (GEMM, not an action of the examples) that passes rtl simulation with pslse but when I deploy the action on the 9V3 card I'm getting a bus error. The output of dmesg is appended below.

Some images have been successfully tested on the card but some fail with this "bus error" output. While I'm experimenting and debugging, I've added this issue here, in case there is a "known" way to debug more. Since pslse is OK, it's hard to debug on the card.

Please note, that when I get a "bus error", the card usually switches to the factory image and the system is not affected. But sometimes, it causes the system to reboot. (all images are within -200psWNS).

On the terminal:

[did@zhcc067 /u/did]$ sudo /dataL/did/snap/actions/hls_gemm/sw/snap_gemm -n 512 -k 64 -m 512
INFO: AXI/Cache lines for a  : 2048/1024
INFO: AXI/Cache lines for b  : 2048/1024
INFO: AXI/Cache lines for IN : 4096/2048
INFO: AXI/Cache lines for OUT: 16384/8192
INFO: size_n=512, size_k=64, size_m=512
INFO: printArray 512x64=32768
INFO: printArray 64x512=32768
PARAMETERS:
  input:       none
  output:      none
  type_in:     0 HOST_DRAM
  addr_in:     00007fff83de0000
  type_out:    0 HOST_DRAM
  addr_out:    00007fff83cc0000
  size_in/out: 0000000c
  prepare gemm job of 48 bytes size
Bus error
[did@zhcc067 /u/did]$

dmesg output:

[21171.613018] Harmless Hypervisor Maintenance interrupt [Recovered]
[21171.613482]  Error detail: CAPP recovery process is in progress
[21171.613700]  HMER: 8040000000000000
[21172.032538] Harmless Hypervisor Maintenance interrupt [Recovered]
[21172.038746]  Error detail: CAPP recovery process is in progress
[21172.041086] EEH: Fenced PHB#0 detected, location: N/A
[21172.048256] EEH: This PCI device has failed 1 times in the last hour
[21172.048257] EEH: Notify device drivers to shutdown
[21172.048265] cxl-pci 0000:01:00.0: reflashing, so opting out of EEH!
[21172.048302] EEH: Collect temporary log
[21172.048304] PHB4 PHB#0 Diag-data (Version: 1)
[21172.048305] brdgCtl:    00000002
[21172.048306] RootSts:    00000040 00402000 e1010008 00100107 00000000
[21172.048308] nFir:       0000008000000000 0030001c00000000 0000008000000000
[21172.048309] PhbSts:     0000001800000000 0000001800000000
[21172.048310] Lem:        0000000100000100 0000000000000000 0000000000000100
[21172.048312] PhbErr:     0000048000000000 0000040000000000 2148000098000240 a008400000000000
[21172.048314] RxeMrgErr:  0000000000000001 0000000000000001 0000000000000000 0000000000000000
[21172.048315] RegbErr:    0050000000000000 0010000000000000 8800003c00000000 0000000000000000
[21172.048319] EEH: Reset with hotplug activity
[21172.048435] pci_bus 0008:00: busn_res: [bus 00] is released
[21172.048650] cxl afu0.0: Deactivating AFU directed mode
[21172.084226] cxl afu0.0: PSL Purge called with link down, ignoring
[21172.084648] iommu: Removing device 0000:01:00.0 from group 0
[21172.084874] pci_bus 0000:01: busn_res: [bus 01] is released
[21172.086292]  HMER: 8040000000000000
[21175.588118] EEH: Sleep 5s ahead of complete hotplug
[21180.628196] pci 0000:00:00.0: [1014:04c1] type 01 class 0x060400
[21180.628270] pci 0000:00:00.0: PME# supported from D0 D3hot D3cold
[21180.628435] pci 0000:01:00.0: [1014:0477] type 00 class 0x1200ff
[21180.628463] pci 0000:01:00.0: reg 0x10: [mem 0x6000000000000-0x600000fffffff 64bit pref]
[21180.628474] pci 0000:01:00.0: reg 0x18: [mem 0x6000010000000-0x600001001ffff 64bit pref]
[21180.628486] pci 0000:01:00.0: reg 0x20: [mem 0x00000000-0x3fffffffff 64bit pref]
[21180.628634] pci 0000:00:00.0: PCI bridge to [bus 01]
[21180.628797] pci 0000:00:00.0:   bridge window [io  0x0000-0x0fff]
[21180.628822] pci 0000:01:00.0: disabling BAR 4: [mem size 0x4000000000 64bit pref] (bad alignment 0x4000000000)
[21180.629022] pci 0000:00:00.0: BAR 15: assigned [mem 0x6000000000000-0x600001fffffff 64bit pref]
[21180.629194] pci 0000:01:00.0: BAR 0: assigned [mem 0x6000000000000-0x600000fffffff 64bit pref]
[21180.629366] pci 0000:01:00.0: BAR 2: assigned [mem 0x6000010000000-0x600001001ffff 64bit pref]
[21180.629555] pci 0000:00     : [PE# 1fe] Secondary bus 0 associated with PE#1fe
[21180.629743] pci 0000:01     : [PE# 00] Secondary bus 1 associated with PE#0
[21180.629923] pci 0000:01     : [PE# 00] Setting up 32-bit TCE table at 0..80000000
[21180.632570] pci 0000:01     : [PE# 00] Setting up window#0 0..7fffffff pg=1000
[21180.632706] pci 0000:01     : [PE# 00] Enabling 64-bit DMA bypass
[21180.632830] iommu: Adding device 0000:01:00.0 to group 12, default domain type -1
[21180.632996] pci 0000:00:00.0: PCI bridge to [bus 01]
[21180.633094] pci 0000:00:00.0:   bridge window [mem 0x6000000000000-0x6003fbfffffff 64bit pref]
[21180.633770] pcieport 0000:00:00.0: enabling device (0105 -> 0107)
[21180.634050] cxl-pci 0000:01:00.0: Device uses a PSL9
[21180.634161] cxl-pci 0000:01:00.0: enabling device (0140 -> 0142)
[21180.635062] pci 0000:01     : [PE# 00] Switching PHB to CXL
[21180.635452] pci 0000:01     : [PE# 00] Switching PHB to CXL
[21180.635757] cxl-pci 0000:01:00.0: PCI host bridge to bus 0008:00
[21180.635875] pci_bus 0008:00: root bus resource [bus 00]
[21180.635979] pci_bus 0008:00: busn_res: [bus 00] end is updated to ff
[21180.635987] pci 0008:00:00.0: [1014:0632] type 00 class 0x120000
[21180.636058] pci_bus 0008:00: busn_res: [bus 00-ff] end is updated to 00
[21180.636072] cxl afu0.0: Activating AFU directed mode
[21180.636724] EEH: Notify device driver to resume

bmesnet commented 5 years ago

Hi @diamantopoulos Interesting case similar to very rare cases we have seen. This would be a missing response to a request sent by the PHB to the PCIe. We are trying ti understand this issue, but it is very difficult to reproduce it. Did this happen only once on the image you have created? Are you using the latest release of master branch of snap? I would be interested to understand if you face this issue again when rebuilding this image in same conditions. Thanks

luyong6 commented 5 years ago

Would you carefully check if your AFU has accessed a memory address, which exceeds the memory buffer that has been allocated in software?

It's also possible to be caused by alignment. Check where did you do malloc(), has it been aligned to cacheline (128B), or even pagesize?

luyong6 commented 5 years ago

But sometimes, it causes the system to reboot.

This is because there is a threshold to limit how many times of PCIe faults are allowed. sudo cat /sys/kernel/debug/powerpc/eeh_max_freezes The default number I remember is 5. That means if 5 times of errors happen, the system thinks something is wrong. It reboots or removes the card.

diamantopoulos commented 5 years ago

Would you carefully check if your AFU has accessed a memory address, which exceeds the memory buffer that has been allocated in software?

It's also possible to be caused by alignment. Check where did you do malloc(), has it been aligned to cacheline (128B), or even pagesize?

Thanks @luyong6 for the hints. It shouldn't be any of the two cases above:

I have checked that AFU does not access any unallocated space: apart from inspecting the code, I employed a memory debugger (valgrind --tool memcheck) in both sw action and hw action (no memory leaks/errors/warnings etc.).

I also use the snap_memcopy wrapper to make sure the chunk is aligned:

static inline void *snap_malloc(size_t size)
{
    unsigned int page_size = sysconf(_SC_PAGESIZE);
    return memalign(page_size, SNAP_ROUND_UP(size, SNAP_MEMBUS_WIDTH));
}

Indeed, the number in /sys/kernel/debug/powerpc/eeh_max_freezes is 0x5. I understand that since it is a PCIe fault, then the kernel is responsible of taking an action. However, from an application developer perspective, its uncomfortable to have a reboot on the system, with a faulty bitstream (still passing pslse and with WNS<200), i.e. in our case, the system is part of cloud and many users are affected by suddenly loosing a node. Removing the card is a more preferred option, which however does not happens often (most of the times it is a reboot).

@bmesnet thanks for the hints - that's nice to know its a rare but known issue (so it is not necessarily an issue with our witherspoon setup)

Did this happen only once on the image you have created?

No, it happens in several images in which I combine different architectural parameters, e.g. unrolling/pipelining depth, memory prefetching etc. However, the parameters do not affect the I/O protocol of the action to capi and vice-versa (apart from the size in/out). My only concern is that I have a dataflow architecture which reads from AXI as long as the accelerator consumes data and maybe the fifo depth is not enough to store the burst reads. However, this is why I always test with pslse to verify the functionality. I'm in the process of testing different combinations to understand when the problem arises.

Are you using the latest release of master branch of snap?

I'm using 4defea27569231f4e579c36bf3d0eac842081025 . I'll test with the latest and report.

open-power / snap

bus error on deployment of an action that passes pslse #882