Closed zyt1024 closed 11 months ago
Dear @zyt1024,
Thank you,
Tensor reshaping requires no data manipulation on the host-side. But, since the device-side tensors are padded in the least significant axis (for example, for a tensor of rank 4, axis=3 or axis=-1), any change to the size of the last dimension will also require device-side data to be modified and re-arranged.
For example, on the host-side, reshaping a tensor of shape (2, 7) into (14) requires no data modifications. The array holding the tensor data is not modified or accessed in any way. On the device side, the shape (2,7) is internally padded to (2,16). The target shape (14) is also going to be padded to (16) internally. So, as you can guess, (2,16) is not compatible with (16) and this particular reshaping will require data re-arrangement.
In short, we try to pad the tensors on the host-side, transfer them into the device, run one or more kernels, and finally transfer the padded result(s) back to the host-side memory and un-pad them only once.
Dear @salehjg
Thank you, I see.
After reading the performance indicators of your paper, I have a question?.You run8x
faster than on a CPU
. I'd like to know whether you are comparing using tensorflow
or pytorch
to run the model? Or run the model with onnxxruntime
? Or compare it with the CPU
you implemented?
Sure, the comparisons are in between the whole design fitted on a single FPGA (with two or more single-channel DDR4s working with their original bit-widths) and the naive single-thread CPU implementations with and without -o3
compiler optimizations.
There are many kernel-specific configurations in the config repository that directly affect the parallelism and we had to set them to the bare minimum values to be able to fit the design in a single FPGA.
Sure, the comparisons are in between the whole design fitted on a single FPGA (with two or more single-channel DDR4s working with their original bit-widths) and the naive single-thread CPU implementations with and without
-o3
compiler optimizations.There are many kernel-specific configurations in the config repository that directly affect the parallelism and we had to set them to the bare minimum values to be able to fit the design in a single FPGA.
Thank you,
Can this FPGA implementation
run faster than onnxruntime
?
Sure, the comparisons are in between the whole design fitted on a single FPGA (with two or more single-channel DDR4s working with their original bit-widths) and the naive single-thread CPU implementations with and without
-o3
compiler optimizations. There are many kernel-specific configurations in the config repository that directly affect the parallelism and we had to set them to the bare minimum values to be able to fit the design in a single FPGA.Thank you, Can this FPGA
implementation
run faster thanonnxruntime
?
We have not tried onnxruntime
.
I am closing this issue, please feel free to reopen it or post a new one.
Thw Project is nice! I have one question: In DGCNN, there are many reshape operations used in edge_feature, but I didn't see any related kernel implementation. How is this handled?