Is there a profiler in cltorch to capture performance figures

streamhsa commented 8 years ago

hi ,

i m looking to a benchmark/profiler for cltorch to capture scores . is it already there , if yes, can you tell the procedure on how to capture .

Thanks Subhani

hughperkins commented 8 years ago

Hi Subhani,

Sure! There are wallclock timings, and kernel execution timings, both are available. For an example:

require 'cltorch'

a = torch.ClTensor(1000,1000):uniform()
cltorch.setEnableTiming(1)
cltorch.setProfiling(1)
a:add(1)
cltorch.synchronize()

print('timings:')
cltorch.dumpTimings()
print('')

print('profiling:')
cltorch.dumpProfiling()
print('')

output:

Using NVIDIA Corporation , OpenCL platform: NVIDIA CUDA
Using OpenCL device: GeForce 940M
Timing activated
Profiling activated
statefultimer v0.7
timings:
dump enabled=1
StatefulTimer readings:
   Apply END Apply_1t_1s_0pt_-2_*out += val1: 0.179932ms count=1
   Apply compiled: 2.26807ms count=1
   Apply getname: 0.0290527ms count=1
   Apply got kernel: 0.0700684ms count=1
   Apply gotname: 0.0349121ms count=1
   THClTEnsor_pointwiseApply END: 0.00512695ms count=1
   THClTEnsor_pointwiseApply START: 0.0090332ms count=1
   before dump: 0.536865ms count=1

profiling:
Apply_1t_1s_0pt_-2_*out += val1.THClTensor_pointwiseApplyD 0.53648ms

hughperkins commented 8 years ago

(Also, if you want to call clFinish() after every kernel launch, wihch makes things slower, but makes the wallclock timings more representative, you can call cltorch.setAddFinish(1), eg:

require 'cltorch'

a = torch.ClTensor(1000,1000):uniform()
cltorch.setEnableTiming(1)
cltorch.setProfiling(1)
cltorch.setAddFinish(1)
for i=1,10 do
  a:add(1)
end
cltorch.synchronize()

print('timings:')
cltorch.dumpTimings()
print('')

print('profiling:')
cltorch.dumpProfiling()
print('')

Output:

Using NVIDIA Corporation , OpenCL platform: NVIDIA CUDA
Using OpenCL device: GeForce 940M
Timing activated
Profiling activated
AddFinish activated
statefultimer v0.7
timings:
dump enabled=1
StatefulTimer readings:
   Apply END Apply_1t_1s_0pt_-2_*out += val1: 5.59424ms count=10
   Apply compiled: 2.51904ms count=1
   Apply getname: 0.048584ms count=10
   Apply got kernel: 0.0800781ms count=10
   Apply gotname: 0.0476074ms count=10
   THClTEnsor_pointwiseApply END: 0.00927734ms count=10
   THClTEnsor_pointwiseApply START: 0.0461426ms count=10
   before dump: 0.0168457ms count=1

profiling:
Apply_1t_1s_0pt_-2_*out += val1.THClTensor_pointwiseApplyD 5.25517ms

streamhsa commented 8 years ago

Thanks hughperkins. That is very useful.

Like UnitTests, does this package have any benchmark tests already available in this package.

hughperkins commented 8 years ago

Like UnitTests, does this package have any benchmark tests already available in this package.

Yes, sure! In https://github.com/hughperkins/clnn Assuming you have installed using https://github.com/hughperkins/distro-cl , into ~/torch-cl, you can run like:

cd ~/torch-cl/opencl/clnn
luajit test/test-perf.lua

This is the same scripts as used in soumith's convnet benchmarks

streamhsa commented 8 years ago

That's great. Thank you so much .

hughperkins / cltorch

Is there a profiler in cltorch to capture performance figures #73