masadcv / FastGeodis

Fast Implementation of Generalised Geodesic Distance Transform for CPU (OpenMP) and GPU (CUDA)
https://fastgeodis.readthedocs.io
BSD 3-Clause "New" or "Revised" License
90 stars 14 forks source link

[BUG] Improve CUDA kernels to reduce number of invocations #20

Closed masadcv closed 2 years ago

masadcv commented 2 years ago

As reported in https://github.com/openjournals/joss-reviews/issues/4532 in the comment https://github.com/openjournals/joss-reviews/issues/4532#issuecomment-1192949580

The cuda kernels are called many times (invocations == number of rows/planes). As indicated by reviewer, this may incur additional costs for setting up and running kernels each time on new row.

An expected behviour here would be to setup and call kernel for each pass once. This will reduce the overhead added by many calls and would potentially improve the overall execution efficiency of the current algorithms.

masadcv commented 2 years ago

As noted in #21 - The proposed improvements in #21 reduce the invocations to just the number of passes (8 from 1336 for 2D and 12 from 1108 for 3D example).

This leads to following improvements in nvprof:

After merging PR

 $ nvprof ./samples/simpledemo2d_profile.py
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   89.30%  664.51us         8  83.063us  67.552us  98.240us  geodesic_updown_single_row_pass_ptr_kernel(float*, float*, float, float, int, int, int, int)

 $ nvprof ./samples/simpledemo3d_profile.py
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   78.00%  2.1346ms        12  177.89us  68.896us  255.93us  geodesic_frontback_single_plane_pass_ptr_kernel(float*, float*, float, float, int, int, int, int, int)

Before PR

 $ nvprof ./samples/simpledemo2d_profile.py
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
  GPU activities:   98.46%  5.0414ms      1336  3.7730us  3.0070us  4.1600us  void geodesic_updown_single_row_pass_kernel<float>(at::GenericPackedTensorAccessor<float, unsigned long=4, at::RestrictPtrTraits, int>, at::GenericPackedTensorAccessor<float, unsigned long=4, at::RestrictPtrTraits, int>, float, float, int, int)

 $ nvprof ./samples/simpledemo3d_profile.py
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
  GPU activities:   90.99%  6.0111ms      1108  5.4250us  4.6080us  8.0000us  void geodesic_frontback_single_plane_pass_kernel<float>(at::GenericPackedTensorAccessor<float, unsigned long=5, at::RestrictPtrTraits, int>, at::GenericPackedTensorAccessor<float, unsigned long=5, at::RestrictPtrTraits, int>, float, float, int, int)