Open A-Kibats opened 12 months ago
Additionally during 16384 runs i'm now getting warnings of soft lock-up on the CPU when it reaches executing kernel:
Message from syslogd@nextgenio-amd01 at Dec 1 11:37:38 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [RunHardware.exe:85487]
Message from syslogd@nextgenio-amd01 at Dec 1 11:37:38 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [RunHardware.exe:85489]
Message from syslogd@nextgenio-amd01 at Dec 1 11:38:06 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [RunHardware.exe:85487]
Message from syslogd@nextgenio-amd01 at Dec 1 11:38:06 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [RunHardware.exe:85489]
Again this does not occur on smaller matrix sizes.
Hi Again,
Further progress has been made into the issue. Having built a SGEMM build on a U250 card we encounter the same XRT error when running 16k matrices.
Here is the system configuration as given by xbutil examine
:
System Configuration
OS Name : Linux
Release : 3.10.0-1160.99.1.el7.x86_64
Version : #1 SMP Wed Sep 13 14:19:20 UTC 2023
Machine : x86_64
CPU Cores : 128
Memory : 257749 MB
Distribution : CentOS Linux 7 (Core)
GLIBC : 2.17
Model : ProLiant DL385 Gen10 Plus
XRT
Version : 2.11.634
Branch : 2021.1
Hash : 5ad5998d67080f00bca5bf15b3838cf35e0a7b26
Hash Date : 2021-06-09 05:08:58
XOCL : 2.11.634, 5ad5998d67080f00bca5bf15b3838cf35e0a7b26
XCLMGMT : 2.11.634, 5ad5998d67080f00bca5bf15b3838cf35e0a7b26
Devices present
[0000:c3:00.1] : xilinx_u280_xdma_201920_3
We noticed that both our U250 and U280 cards fail test 7 when using xbutil validate
:
Test 7 [0000:c3:00.1] : Bandwidth kernel
Error(s) :
terminate called after throwing an instance of
'std::runtime_error'
what(): Multiple instances of XRT core shim library
detected, only one
can be loaded at any given time. Please check if
application is
explicity linked with XRT core library (xrt_core,
xrt_hwemu, or
xrt_swemu) and remove this linking. Use XCL_EMULATION_MODE
set to
either hw_emu or sw_emu if running in emulation mode.
Test Status : [FAILED]
Could this possibly be the source of the issue?
Hey! Since this only occurs with large matrix sizes and throws an I/O error, it could be related to the size of the memory transfer. If my math is right, transferring 3x 16384x16384 matrices amounts to 6.4 GB, which I suppose could be an issue for the virtual HBM channels on the U280 (I believe the individual virtual channels have smaller capacity than this), but should work fine in DDR 🤔
Are you completely sure the issue you see is identical between the U280 and the U250, or is there any chance that they are separate issues?
Hi, thanks for the reply.
We were suggested this as well by AMD/Xilinx, that it is a memory issue and we're in the processes of checking the usage.
The issue is not completely identical as SGEMM works on U280 but doesn't work on U250 and has the same issue DGEMM has on U280. I've checked the Config.h in the directories and SGEMM was built with the same parameters on both cards so why U250 gives the same XRT issue is a mystery at the moment.
Any news @A-Kibats?
Hi,
Running a matrix with the size of 16384 on a DGEMM build returns the following errors:
This seems to only occur when the card has the bit stream already loaded as resetting the card with
xbutil reset
and running it for the first time does not give the same error.Smaller size matrices seem to work fine with 12288 being the highest that reliably worked. (4k, 8k, 12k, 16k was the test range).
CmakeLists.txt was kept with relatively default settings exceptions being the card string was changed to "xilinx_u280_xdma_201920_3", and being modified to build dgemm based on the README.md found within gemm_hls:
Here is the system configuration as given by
xbutil examine
:Any help would be much appreciated. Cheers, Andrew.