Closed masadcv closed 2 years ago
Proposal to reduce invocations of CUDA kernels, so the impact of setting up kernels is minimised
This addresses #20
The following steps are implemented:
samples\*_profile.py
This PR leads to the following improvements:
After merging PR
$ nvprof ./samples/simpledemo2d_profile.py Type Time(%) Time Calls Avg Min Max Name GPU activities: 89.30% 664.51us 8 83.063us 67.552us 98.240us geodesic_updown_single_row_pass_ptr_kernel(float*, float*, float, float, int, int, int, int) $ nvprof ./samples/simpledemo3d_profile.py Type Time(%) Time Calls Avg Min Max Name GPU activities: 78.00% 2.1346ms 12 177.89us 68.896us 255.93us geodesic_frontback_single_plane_pass_ptr_kernel(float*, float*, float, float, int, int, int, int, int)
Before PR
$ nvprof ./samples/simpledemo2d_profile.py Type Time(%) Time Calls Avg Min Max Name GPU activities: 98.46% 5.0414ms 1336 3.7730us 3.0070us 4.1600us void geodesic_updown_single_row_pass_kernel<float>(at::GenericPackedTensorAccessor<float, unsigned long=4, at::RestrictPtrTraits, int>, at::GenericPackedTensorAccessor<float, unsigned long=4, at::RestrictPtrTraits, int>, float, float, int, int) $ nvprof ./samples/simpledemo3d_profile.py Type Time(%) Time Calls Avg Min Max Name GPU activities: 90.99% 6.0111ms 1108 5.4250us 4.6080us 8.0000us void geodesic_frontback_single_plane_pass_kernel<float>(at::GenericPackedTensorAccessor<float, unsigned long=5, at::RestrictPtrTraits, int>, at::GenericPackedTensorAccessor<float, unsigned long=5, at::RestrictPtrTraits, int>, float, float, int, int)
Proposal to reduce invocations of CUDA kernels, so the impact of setting up kernels is minimised
This addresses #20
The following steps are implemented:
samples\*_profile.py
This PR leads to the following improvements:
After merging PR
Before PR