salehjg / DeepPoint-V2-FPGA

The code repository of DGCNN on FPGA: Acceleration of The Point Cloud Classifier Using FPGAs
GNU General Public License v3.0
12 stars 2 forks source link

[Question] how to process the reshape? #3

Closed zyt1024 closed 11 months ago

zyt1024 commented 1 year ago

Thw Project is nice! I have one question: In DGCNN, there are many reshape operations used in edge_feature, but I didn't see any related kernel implementation. How is this handled?

salehjg commented 1 year ago

Dear @zyt1024,

Thank you,

Tensor reshaping requires no data manipulation on the host-side. But, since the device-side tensors are padded in the least significant axis (for example, for a tensor of rank 4, axis=3 or axis=-1), any change to the size of the last dimension will also require device-side data to be modified and re-arranged.

For example, on the host-side, reshaping a tensor of shape (2, 7) into (14) requires no data modifications. The array holding the tensor data is not modified or accessed in any way. On the device side, the shape (2,7) is internally padded to (2,16). The target shape (14) is also going to be padded to (16) internally. So, as you can guess, (2,16) is not compatible with (16) and this particular reshaping will require data re-arrangement.

In short, we try to pad the tensors on the host-side, transfer them into the device, run one or more kernels, and finally transfer the padded result(s) back to the host-side memory and un-pad them only once.

zyt1024 commented 1 year ago

Dear @salehjg Thank you, I see. After reading the performance indicators of your paper, I have a question?.You run8x faster than on a CPU. I'd like to know whether you are comparing using tensorflowor pytorchto run the model? Or run the model with onnxxruntime? Or compare it with the CPUyou implemented?

salehjg commented 1 year ago

Sure, the comparisons are in between the whole design fitted on a single FPGA (with two or more single-channel DDR4s working with their original bit-widths) and the naive single-thread CPU implementations with and without -o3 compiler optimizations.

There are many kernel-specific configurations in the config repository that directly affect the parallelism and we had to set them to the bare minimum values to be able to fit the design in a single FPGA.

zyt1024 commented 1 year ago

Sure, the comparisons are in between the whole design fitted on a single FPGA (with two or more single-channel DDR4s working with their original bit-widths) and the naive single-thread CPU implementations with and without -o3 compiler optimizations.

There are many kernel-specific configurations in the config repository that directly affect the parallelism and we had to set them to the bare minimum values to be able to fit the design in a single FPGA.

Thank you, Can this FPGA implementationrun faster than onnxruntime?

salehjg commented 1 year ago

Sure, the comparisons are in between the whole design fitted on a single FPGA (with two or more single-channel DDR4s working with their original bit-widths) and the naive single-thread CPU implementations with and without -o3 compiler optimizations. There are many kernel-specific configurations in the config repository that directly affect the parallelism and we had to set them to the bare minimum values to be able to fit the design in a single FPGA.

Thank you, Can this FPGA implementationrun faster than onnxruntime?

We have not tried onnxruntime.

salehjg commented 11 months ago

I am closing this issue, please feel free to reopen it or post a new one.