Open diamantopoulos opened 5 years ago
Hi @diamantopoulos Interesting case similar to very rare cases we have seen. This would be a missing response to a request sent by the PHB to the PCIe. We are trying ti understand this issue, but it is very difficult to reproduce it. Did this happen only once on the image you have created? Are you using the latest release of master branch of snap? I would be interested to understand if you face this issue again when rebuilding this image in same conditions. Thanks
Would you carefully check if your AFU has accessed a memory address, which exceeds the memory buffer that has been allocated in software?
It's also possible to be caused by alignment. Check where did you do malloc(), has it been aligned to cacheline (128B), or even pagesize?
But sometimes, it causes the system to reboot.
This is because there is a threshold to limit how many times of PCIe faults are allowed.
sudo cat /sys/kernel/debug/powerpc/eeh_max_freezes
The default number I remember is 5.
That means if 5 times of errors happen, the system thinks something is wrong. It reboots or removes the card.
Would you carefully check if your AFU has accessed a memory address, which exceeds the memory buffer that has been allocated in software?
It's also possible to be caused by alignment. Check where did you do malloc(), has it been aligned to cacheline (128B), or even pagesize?
Thanks @luyong6 for the hints. It shouldn't be any of the two cases above:
I have checked that AFU does not access any unallocated space: apart from inspecting the code, I employed a memory debugger (valgrind --tool memcheck) in both sw action and hw action (no memory leaks/errors/warnings etc.).
I also use the snap_memcopy wrapper to make sure the chunk is aligned:
static inline void *snap_malloc(size_t size)
{
unsigned int page_size = sysconf(_SC_PAGESIZE);
return memalign(page_size, SNAP_ROUND_UP(size, SNAP_MEMBUS_WIDTH));
}
Indeed, the number in /sys/kernel/debug/powerpc/eeh_max_freezes
is 0x5. I understand that since it is a PCIe fault, then the kernel is responsible of taking an action. However, from an application developer perspective, its uncomfortable to have a reboot on the system, with a faulty bitstream (still passing pslse and with WNS<200), i.e. in our case, the system is part of cloud and many users are affected by suddenly loosing a node. Removing the card is a more preferred option, which however does not happens often (most of the times it is a reboot).
@bmesnet thanks for the hints - that's nice to know its a rare but known issue (so it is not necessarily an issue with our witherspoon setup)
Did this happen only once on the image you have created?
No, it happens in several images in which I combine different architectural parameters, e.g. unrolling/pipelining depth, memory prefetching etc. However, the parameters do not affect the I/O protocol of the action to capi and vice-versa (apart from the size in/out). My only concern is that I have a dataflow architecture which reads from AXI as long as the accelerator consumes data and maybe the fifo depth is not enough to store the burst reads. However, this is why I always test with pslse to verify the functionality. I'm in the process of testing different combinations to understand when the problem arises.
Are you using the latest release of master branch of snap?
I'm using 4defea27569231f4e579c36bf3d0eac842081025 . I'll test with the latest and report.
Dear team,
I'm facing the following problem: I'm developing an action (GEMM, not an action of the examples) that passes rtl simulation with pslse but when I deploy the action on the 9V3 card I'm getting a bus error. The output of dmesg is appended below.
Some images have been successfully tested on the card but some fail with this "bus error" output. While I'm experimenting and debugging, I've added this issue here, in case there is a "known" way to debug more. Since pslse is OK, it's hard to debug on the card.
Please note, that when I get a "bus error", the card usually switches to the factory image and the system is not affected. But sometimes, it causes the system to reboot. (all images are within -200psWNS).
On the terminal:
dmesg output: