Help on Performance Details

sfox14 / pynq-ekf

A multi-board Extended Kalman Filter (EKF)

BSD 3-Clause "New" or "Revised" License

28 stars 11 forks source link

Help on Performance Details #10

Open vjvdn opened 5 years ago

vjvdn commented 5 years ago

Hi,

Thank you for posting this design. Could you help me understand the performance numbers that are quoted. On Pynq board, the performance quoted was: Naive SW (Python) : 64.1ms HW-SW (Python) : 83.3ms HW-only (Python) : 1.8ms

Is Naive SW code running on ARM of Pynq device?
For HW-SW case: Is the functionality partitioning between HW-SW same as shown in https://github.com/sfox14/pynq-ekf/blob/master/utils/images/hwsw.png?
HW-SW case takes more time than SW case. Does that mean HW is not accelerating the functionality?
HW-only: Is all computation done on HW (NOT as partitioned in https://github.com/sfox14/pynq-ekf/blob/master/utils/images/hwsw.png)?
Are these execution times for one iteration or for all 25 iterations of your input?

Thanks & Regards

sfox14 commented 5 years ago

Sure.

Yes it's running on the ARM. The source is in utils/python
Yes, computing f(x), h(x), F and H are done in python on ARM
True, moving f,h,H,F between PS and PL every iteration is costly and dominates the execution time. On larger problems (like 50 hidden states) where the amount of computation is greater, the FPGA offers a speedup.
Yes.
I believe the times are based on 5000 iterations. The dataset is 50 observations, so it's just run 100 times.

vjvdn commented 5 years ago

Thank you for the reply.

Could you share the resource utilization for PL part in HW-SW case
Could you share the resource utilization for HW only case
We could find 25 observations only in the csv file. You would have run them for 200 times. That means the performance per iteration is: Naive SW (Python) : 13 us HW-SW (Python) : 17 us HW-only (Python) : 0.4 us

In HW only case, we would have 36 cycles of 100MHz clock to complete one iteration which has matrix inverse, matrix multiplication and other operations. That means FPGA resource utilization might be high. Could you share HW utilization details. Thanks again.

sfox14 commented 5 years ago

I agree, 36 cycles is far too small. I will try rebuilding since I don't actually have the Vivado project files anymore. This may take a day or two.

vjvdn commented 5 years ago

Thanks a lot for your quick reply. Will wait for your response on the HW resource utilization. Thanks again.

sfox14 commented 5 years ago

The utilisation of hw-only (i.e. gps) is: LUTs: 48%, FF: 33%, DSP: 96%, BRAM: 13% The utilisation of hw-sw (n8m4) is: LUTs: 50%, FF: 33% DSP: 97% BRAM: 24%

Here are the HLS reports for more details: gps_top_ekf_csynth.txt n8m4_top_ekf_csynth.txt

This obviously doesn't account for every cycle of latency. There's still the software overhead. You can get some numbers for this by rebuilding in SDx.

vjvdn commented 5 years ago

Thanks a lot. Could you comment on the performance numbers. Is it possible to publish performance numbers for one iteration? Thank you & Regards