robert-mijakovic commented 2 years ago

Summary

The Reference Designs of DPC++FPGA are not using DMA to transfer data from host to device because of lack of alignment of host ptr and/or dev offset is not aligned to 64 bytes.

Version

Tip of the repository and earlier releases.

Environment

Bittware 520N-MX, Stratix 10 based board.
OS: Rocky Linux release 8.5 (Green Obsidian)
Intel DPC++/C++ Compiler for Linux : 2022.0.2
FPGA add-on for custom platforms using Intel Quartus software 20.4 for Linux: 2021.3.0

Steps to reproduce

Build and execute the Reference Designs from the repository.

Observed behavior

The Reference Designs run slowly and show that DMA transfers are not used:

** WARNING: [aclbitt_s10mx_pcie0] NOT using DMA to transfer 2757120 bytes from host to device because of lack of alignment
**                 host ptr (0xa28ed70) and/or dev offset (0x400) is not aligned to 64 bytes

For instance, compressing a 100M file with gzip takes more than 3 hours.

Expected behavior

That it executes without performance issues so that I can compare CPU and FPGA implementations.

tyoungsc commented 2 years ago

Thanks for this @robert-mijakovic. I cannot reproduce this locally, but have some ideas on how to do so. Just confirming, you ran into this with all reference designs? Or just GZIP?

robert-mijakovic commented 2 years ago

Yes, I ran into this with all reference designs.

tyoungsc commented 2 years ago

Darn. Sorry about that. This seems to be related to the s10mx BSP: I cannot recreate this with other BSPs locally. Do you have access to any other BSP?

robert-mijakovic commented 2 years ago

No, not really. We only have BSP for the card that we have, Bittware 520N-MX. Aside from 8 reference designs, I have another example from Bittware, 2D FFT, but it doesn't experience this issue.

tyoungsc commented 2 years ago

Okay, no worries. I will try reproducing locally and look for a workaround for you!

robert-mijakovic commented 2 years ago

Thank you.

tyoungsc commented 2 years ago

@yuguen-intel FYI.

robert-mijakovic commented 2 years ago

Hi everyone,

has there been any progress on this one? We would like to give it a try.

Best regards, Robert

yuguen-intel commented 2 years ago

Hey - unfotunately no progress yet. As the problem seems to be coming from the BSP, it seems that there won't be a straightforward solution. We have someone looking at what can be done, I'll report back here if there is any progress.

Cheers

yuguen-intel commented 2 years ago

Hey @robert-mijakovic, we were not able to reproduce this behavior on the BSPs that we have. You mentioned that this issue is not only occuring on the GZIP reference design, but on all reference designs. Would you be able to run a few experiments do that we may narrow down on what the issue is? These two reference designs would help us identify where is the issue coming from:

qri
crr

robert-mijakovic commented 2 years ago

Hi @yuguen-intel,

sure, I can do them immediately. Let me know what I have to do with them. I have them already compiled but can recompile too if necessary.

Best regards, Robert

yuguen-intel commented 2 years ago

Can you just execute both of them of the FPGA? If I remember correctly, none of the two require additional parameters: ./qri.fpga and ./crr.fpga What is the throughput you get for both of these? and what is the quartus acheived fmax reported in the reports? And do you get the warning about not using the DMA to transfer the data? Thanks!

robert-mijakovic commented 2 years ago

crr

Bittware Stratix 10 MX run

Running on device:  p520_hpc_m210h_g3x16 : BittWare Stratix 10 MX OpenCL platform (aclbitt_s10mx_pcie0)
Device name: p520_hpc_m210h_g3x16 : BittWare Stratix 10 MX OpenCL platform (aclbitt_s10mx_pcie0)

** WARNING: [aclbitt_s10mx_pcie0] NOT using DMA to transfer 1920 bytes from host to device because of lack of alignment
**                 host ptr (0xc045270) and/or dev offset (0x400) is not aligned to 64 bytes
** WARNING: [aclbitt_s10mx_pcie0] NOT using DMA to transfer 7864320 bytes from host to device because of lack of alignment
**                 host ptr (0x14d62e591010) and/or dev offset (0xc00) is not aligned to 64 bytes
** WARNING: [aclbitt_s10mx_pcie0] NOT using DMA to transfer 1920 bytes from device to host because of lack of alignment
**                 host ptr (0xc046190) and/or dev offset (0x780c00) is not aligned to 64 bytes
** WARNING: [aclbitt_s10mx_pcie0] NOT using DMA to transfer 1920 bytes from host to device because of lack of alignment
**                 host ptr (0xc045270) and/or dev offset (0x400) is not aligned to 64 bytes
** WARNING: [aclbitt_s10mx_pcie0] NOT using DMA to transfer 7864320 bytes from host to device because of lack of alignment
**                 host ptr (0x14d62e591010) and/or dev offset (0xc00) is not aligned to 64 bytes

============= Correctness Test =============
Running analytical correctness checks...
CPU-FPGA Equivalence: PASS

============= Throughput Test =============
   Avg throughput:   2.7 assets/s
mel3002: execution time - OneAPI fpga           : 00:00:46

Intel FPGA Emulation Device run

Running on device:  Intel(R) FPGA Emulation Device
Device name: Intel(R) FPGA Emulation Device

============= Correctness Test =============
Running analytical correctness checks...
CPU-FPGA Equivalence: PASS

============= Throughput Test =============
   Avg throughput:   3.4 assets/s
mel3002: execution time - OneAPI fpga_emu       : 00:00:11

qri

Bittware Stratix 10 MX run

Device name: p520_hpc_m210h_g3x16 : BittWare Stratix 10 MX OpenCL platform (aclbitt_s10mx_pcie0)
Generating 8 random real matrices of size 32x32
Running QR inversion of 8 matrices 6553600 times
** WARNING: [aclbitt_s10mx_pcie0] NOT using DMA to transfer 32768 bytes from host to device because of lack of alignment
**                 host ptr (0xa354530) and/or dev offset (0x400) is not aligned to 64 bytes
   Total duration:   314.772 s
Throughput: 166.561k matrices/s
Verifying results on matrix 0
1
2
3
4
5
6
7

PASSED
mel3001: execution time - OneAPI fpga           : 00:05:16

Intel FPGA Emulation Device run

Device name: Intel(R) FPGA Emulation Device
Generating 8 random real matrices of size 32x32
Running QR inversion of 8 matrices 16 times
   Total duration:   4300.38 s
Throughput: 2.97648e-05k matrices/s
Verifying results on matrix 0
1
2
3
4
5
6
7

PASSED
mel3001: execution time - OneAPI fpga_emu       : 01:11:49

yuguen-intel commented 2 years ago

Many thanks for these experiments, I'll discuss that with the team and let you know!

andrewchorlian commented 2 years ago

Hi Robert,

I posted this feedback on your ticket in our system, but I thought I'd include it here as well:

All the Intel OneAPI examples will suffer from this problem. This is because, as the warning message states, the data has not been byte aligned. The host to card transfer is inferred by the SYCL code, however it still uses the underlying BittWare DMA engine which requires buffers to be aligned to 64 bytes.

The FFT example does this by using the “aligned_alloc” method, e.g…

HBM[0] = (float)aligned_alloc(byte_alignment,sizeof(float)4(HBM_FFT_SIZEHBM_FFT_SIZE/16));

This array is then used later when creating the shared host/accelerator buffer.

buffer<float4, 1> buffer_in0((float4*)HBM[0], num_items);

Using the “aligned_malloc” command ensures the data is byte aligned and the full host to card bandwidth is achieved.

yuguen-intel commented 2 years ago

Thanks @andrewchorlian for coming forward and documenting this here.

@robert-mijakovic I was going to give you the same kind of conclusion. This is a limitation of the BSP that you are using. The documentation we provide to create a BSP does not hint that addresses are going to be aligned on 64 bytes, and as you can see the SYCL frontend does not do so if not specifically told to do so. BittWare implemented this BSP with this limitation which makes users of this BSP to have to ensure correct alignment according to this BSP specification.

I'm therefore going to close this case as this is not a code samples issue but rather a specificity to this BSP.

Yohann

oneapi-src / oneAPI-samples

DPC++FPGA examples: Not using DMA to transfer #916

Summary

Version

Environment

Steps to reproduce

Observed behavior

Expected behavior

crr

Bittware Stratix 10 MX run

Intel FPGA Emulation Device run

qri

Bittware Stratix 10 MX run

Intel FPGA Emulation Device run