spcl / gemm_hls

Scalable systolic array-based matrix-matrix multiplication implemented in Vivado HLS for Xilinx FPGAs.
BSD 3-Clause "New" or "Revised" License
309 stars 54 forks source link

[XRT] ERROR trying to run DGEMM build on Xilinx U280. #32

Open A-Kibats opened 12 months ago

A-Kibats commented 12 months ago

Hi,

Running a matrix with the size of 16384 on a DGEMM build returns the following errors:

[XRT] ERROR: unable to sync BO: Input/output error

[XRT] ERROR: Profiling info not available, make sure profiling is enabled

[XRT] ERROR: Profiling info not available, make sure profiling is enabled

[Kernel executed in 1.84466e+10 seconds, corresponding to a performance of 4.76841e-07 GOp/s.

[XRT] ERROR: unable to sync BO: Input/output error

terminate called after throwing an instance of 'xrt_xocl::error'

  what():  event 0 never submitted

This seems to only occur when the card has the bit stream already loaded as resetting the card with xbutil reset and running it for the first time does not give the same error.

Smaller size matrices seem to work fine with 12288 being the highest that reliably worked. (4k, 8k, 12k, 16k was the test range).

CmakeLists.txt was kept with relatively default settings exceptions being the card string was changed to "xilinx_u280_xdma_201920_3", and being modified to build dgemm based on the README.md found within gemm_hls:

cmake ../ -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DMM_DATA_TYPE=double -DMM_PARALLELISM_N=32 -DMM_PARALLELISM_M=4 -DMM_MEMORY_TILE_SIZE_N=512 -DMM_MEMORY_TILE_SIZE_M=512

Here is the system configuration as given by xbutil examine:

System Configuration
  OS Name              : Linux
  Release              : 3.10.0-1160.99.1.el7.x86_64
  Version              : #1 SMP Wed Sep 13 14:19:20 UTC 2023
  Machine              : x86_64
  CPU Cores            : 128
  Memory               : 257749 MB
  Distribution         : CentOS Linux 7 (Core)
  GLIBC                : 2.17
  Model                : ProLiant DL385 Gen10 Plus

XRT
  Version              : 2.11.634
  Branch               : 2021.1
  Hash                 : 5ad5998d67080f00bca5bf15b3838cf35e0a7b26
  Hash Date            : 2021-06-09 05:08:58
  XOCL                 : 2.11.634, 5ad5998d67080f00bca5bf15b3838cf35e0a7b26
  XCLMGMT              : 2.11.634, 5ad5998d67080f00bca5bf15b3838cf35e0a7b26

Devices present
  [0000:c3:00.1] : xilinx_u280_xdma_201920_3 

Any help would be much appreciated. Cheers, Andrew.

A-Kibats commented 12 months ago

Additionally during 16384 runs i'm now getting warnings of soft lock-up on the CPU when it reaches executing kernel:

Message from syslogd@nextgenio-amd01 at Dec  1 11:37:38 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [RunHardware.exe:85487]

Message from syslogd@nextgenio-amd01 at Dec  1 11:37:38 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [RunHardware.exe:85489]

Message from syslogd@nextgenio-amd01 at Dec  1 11:38:06 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [RunHardware.exe:85487]

Message from syslogd@nextgenio-amd01 at Dec  1 11:38:06 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [RunHardware.exe:85489]

Again this does not occur on smaller matrix sizes.

A-Kibats commented 12 months ago

Hi Again,

Further progress has been made into the issue. Having built a SGEMM build on a U250 card we encounter the same XRT error when running 16k matrices.

Here is the system configuration as given by xbutil examine:

System Configuration
  OS Name              : Linux
  Release              : 3.10.0-1160.99.1.el7.x86_64
  Version              : #1 SMP Wed Sep 13 14:19:20 UTC 2023
  Machine              : x86_64
  CPU Cores            : 128
  Memory               : 257749 MB
  Distribution         : CentOS Linux 7 (Core)
  GLIBC                : 2.17
  Model                : ProLiant DL385 Gen10 Plus

XRT
  Version              : 2.11.634
  Branch               : 2021.1
  Hash                 : 5ad5998d67080f00bca5bf15b3838cf35e0a7b26
  Hash Date            : 2021-06-09 05:08:58
  XOCL                 : 2.11.634, 5ad5998d67080f00bca5bf15b3838cf35e0a7b26
  XCLMGMT              : 2.11.634, 5ad5998d67080f00bca5bf15b3838cf35e0a7b26

Devices present
  [0000:c3:00.1] : xilinx_u280_xdma_201920_3 

We noticed that both our U250 and U280 cards fail test 7 when using xbutil validate:

Test 7 [0000:c3:00.1]     : Bandwidth kernel 
    Error(s)              : 
                            terminate called after throwing an instance of
                            'std::runtime_error'
                              what():  Multiple instances of XRT core shim library
                            detected, only one
                            can be loaded at any given time.  Please check if
                            application is
                            explicity linked with XRT core library (xrt_core,
                            xrt_hwemu, or
                            xrt_swemu) and remove this linking. Use XCL_EMULATION_MODE
                            set to
                            either hw_emu or sw_emu if running in emulation mode.
    Test Status           : [FAILED]

Could this possibly be the source of the issue?

definelicht commented 12 months ago

Hey! Since this only occurs with large matrix sizes and throws an I/O error, it could be related to the size of the memory transfer. If my math is right, transferring 3x 16384x16384 matrices amounts to 6.4 GB, which I suppose could be an issue for the virtual HBM channels on the U280 (I believe the individual virtual channels have smaller capacity than this), but should work fine in DDR 🤔

Are you completely sure the issue you see is identical between the U280 and the U250, or is there any chance that they are separate issues?

A-Kibats commented 11 months ago

Hi, thanks for the reply.

We were suggested this as well by AMD/Xilinx, that it is a memory issue and we're in the processes of checking the usage.

The issue is not completely identical as SGEMM works on U280 but doesn't work on U250 and has the same issue DGEMM has on U280. I've checked the Config.h in the directories and SGEMM was built with the same parameters on both cards so why U250 gives the same XRT issue is a mystery at the moment.

definelicht commented 11 months ago

Any news @A-Kibats?