Closed masadcv closed 2 years ago
As noted in #21 - The proposed improvements in #21 reduce the invocations to just the number of passes (8 from 1336 for 2D and 12 from 1108 for 3D example).
This leads to following improvements in nvprof:
After merging PR
$ nvprof ./samples/simpledemo2d_profile.py
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 89.30% 664.51us 8 83.063us 67.552us 98.240us geodesic_updown_single_row_pass_ptr_kernel(float*, float*, float, float, int, int, int, int)
$ nvprof ./samples/simpledemo3d_profile.py
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 78.00% 2.1346ms 12 177.89us 68.896us 255.93us geodesic_frontback_single_plane_pass_ptr_kernel(float*, float*, float, float, int, int, int, int, int)
Before PR
$ nvprof ./samples/simpledemo2d_profile.py
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 98.46% 5.0414ms 1336 3.7730us 3.0070us 4.1600us void geodesic_updown_single_row_pass_kernel<float>(at::GenericPackedTensorAccessor<float, unsigned long=4, at::RestrictPtrTraits, int>, at::GenericPackedTensorAccessor<float, unsigned long=4, at::RestrictPtrTraits, int>, float, float, int, int)
$ nvprof ./samples/simpledemo3d_profile.py
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 90.99% 6.0111ms 1108 5.4250us 4.6080us 8.0000us void geodesic_frontback_single_plane_pass_kernel<float>(at::GenericPackedTensorAccessor<float, unsigned long=5, at::RestrictPtrTraits, int>, at::GenericPackedTensorAccessor<float, unsigned long=5, at::RestrictPtrTraits, int>, float, float, int, int)
As reported in https://github.com/openjournals/joss-reviews/issues/4532 in the comment https://github.com/openjournals/joss-reviews/issues/4532#issuecomment-1192949580
The cuda kernels are called many times (invocations == number of rows/planes). As indicated by reviewer, this may incur additional costs for setting up and running kernels each time on new row.
An expected behviour here would be to setup and call kernel for each pass once. This will reduce the overhead added by many calls and would potentially improve the overall execution efficiency of the current algorithms.