Closed robert-mijakovic closed 2 years ago
Thanks for this @robert-mijakovic. I cannot reproduce this locally, but have some ideas on how to do so. Just confirming, you ran into this with all reference designs? Or just GZIP?
Yes, I ran into this with all reference designs.
Darn. Sorry about that. This seems to be related to the s10mx BSP: I cannot recreate this with other BSPs locally. Do you have access to any other BSP?
No, not really. We only have BSP for the card that we have, Bittware 520N-MX. Aside from 8 reference designs, I have another example from Bittware, 2D FFT, but it doesn't experience this issue.
Okay, no worries. I will try reproducing locally and look for a workaround for you!
Thank you.
@yuguen-intel FYI.
Hi everyone,
has there been any progress on this one? We would like to give it a try.
Best regards, Robert
Hey - unfotunately no progress yet. As the problem seems to be coming from the BSP, it seems that there won't be a straightforward solution. We have someone looking at what can be done, I'll report back here if there is any progress.
Cheers
Hey @robert-mijakovic, we were not able to reproduce this behavior on the BSPs that we have. You mentioned that this issue is not only occuring on the GZIP reference design, but on all reference designs. Would you be able to run a few experiments do that we may narrow down on what the issue is? These two reference designs would help us identify where is the issue coming from:
Hi @yuguen-intel,
sure, I can do them immediately. Let me know what I have to do with them. I have them already compiled but can recompile too if necessary.
Best regards, Robert
Can you just execute both of them of the FPGA? If I remember correctly, none of the two require additional parameters: ./qri.fpga and ./crr.fpga What is the throughput you get for both of these? and what is the quartus acheived fmax reported in the reports? And do you get the warning about not using the DMA to transfer the data? Thanks!
Running on device: p520_hpc_m210h_g3x16 : BittWare Stratix 10 MX OpenCL platform (aclbitt_s10mx_pcie0)
Device name: p520_hpc_m210h_g3x16 : BittWare Stratix 10 MX OpenCL platform (aclbitt_s10mx_pcie0)
** WARNING: [aclbitt_s10mx_pcie0] NOT using DMA to transfer 1920 bytes from host to device because of lack of alignment
** host ptr (0xc045270) and/or dev offset (0x400) is not aligned to 64 bytes
** WARNING: [aclbitt_s10mx_pcie0] NOT using DMA to transfer 7864320 bytes from host to device because of lack of alignment
** host ptr (0x14d62e591010) and/or dev offset (0xc00) is not aligned to 64 bytes
** WARNING: [aclbitt_s10mx_pcie0] NOT using DMA to transfer 1920 bytes from device to host because of lack of alignment
** host ptr (0xc046190) and/or dev offset (0x780c00) is not aligned to 64 bytes
** WARNING: [aclbitt_s10mx_pcie0] NOT using DMA to transfer 1920 bytes from host to device because of lack of alignment
** host ptr (0xc045270) and/or dev offset (0x400) is not aligned to 64 bytes
** WARNING: [aclbitt_s10mx_pcie0] NOT using DMA to transfer 7864320 bytes from host to device because of lack of alignment
** host ptr (0x14d62e591010) and/or dev offset (0xc00) is not aligned to 64 bytes
============= Correctness Test =============
Running analytical correctness checks...
CPU-FPGA Equivalence: PASS
============= Throughput Test =============
Avg throughput: 2.7 assets/s
mel3002: execution time - OneAPI fpga : 00:00:46
Running on device: Intel(R) FPGA Emulation Device
Device name: Intel(R) FPGA Emulation Device
============= Correctness Test =============
Running analytical correctness checks...
CPU-FPGA Equivalence: PASS
============= Throughput Test =============
Avg throughput: 3.4 assets/s
mel3002: execution time - OneAPI fpga_emu : 00:00:11
Device name: p520_hpc_m210h_g3x16 : BittWare Stratix 10 MX OpenCL platform (aclbitt_s10mx_pcie0)
Generating 8 random real matrices of size 32x32
Running QR inversion of 8 matrices 6553600 times
** WARNING: [aclbitt_s10mx_pcie0] NOT using DMA to transfer 32768 bytes from host to device because of lack of alignment
** host ptr (0xa354530) and/or dev offset (0x400) is not aligned to 64 bytes
Total duration: 314.772 s
Throughput: 166.561k matrices/s
Verifying results on matrix 0
1
2
3
4
5
6
7
PASSED
mel3001: execution time - OneAPI fpga : 00:05:16
Device name: Intel(R) FPGA Emulation Device
Generating 8 random real matrices of size 32x32
Running QR inversion of 8 matrices 16 times
Total duration: 4300.38 s
Throughput: 2.97648e-05k matrices/s
Verifying results on matrix 0
1
2
3
4
5
6
7
PASSED
mel3001: execution time - OneAPI fpga_emu : 01:11:49
Many thanks for these experiments, I'll discuss that with the team and let you know!
Hi Robert,
I posted this feedback on your ticket in our system, but I thought I'd include it here as well:
All the Intel OneAPI examples will suffer from this problem. This is because, as the warning message states, the data has not been byte aligned. The host to card transfer is inferred by the SYCL code, however it still uses the underlying BittWare DMA engine which requires buffers to be aligned to 64 bytes.
The FFT example does this by using the “aligned_alloc” method, e.g…
HBM[0] = (float)aligned_alloc(byte_alignment,sizeof(float)4(HBM_FFT_SIZEHBM_FFT_SIZE/16));
This array is then used later when creating the shared host/accelerator buffer.
buffer<float4, 1> buffer_in0((float4*)HBM[0], num_items);
Using the “aligned_malloc” command ensures the data is byte aligned and the full host to card bandwidth is achieved.
Thanks @andrewchorlian for coming forward and documenting this here.
@robert-mijakovic I was going to give you the same kind of conclusion. This is a limitation of the BSP that you are using. The documentation we provide to create a BSP does not hint that addresses are going to be aligned on 64 bytes, and as you can see the SYCL frontend does not do so if not specifically told to do so. BittWare implemented this BSP with this limitation which makes users of this BSP to have to ensure correct alignment according to this BSP specification.
I'm therefore going to close this case as this is not a code samples issue but rather a specificity to this BSP.
Yohann
Summary
The Reference Designs of DPC++FPGA are not using DMA to transfer data from host to device because of lack of alignment of host ptr and/or dev offset is not aligned to 64 bytes.
Version
Tip of the repository and earlier releases.
Environment
Steps to reproduce
Build and execute the Reference Designs from the repository.
Observed behavior
The Reference Designs run slowly and show that DMA transfers are not used:
For instance, compressing a 100M file with gzip takes more than 3 hours.
Expected behavior
That it executes without performance issues so that I can compare CPU and FPGA implementations.