Closed streamhsa closed 8 years ago
Hi Subhani,
Sure! There are wallclock timings, and kernel execution timings, both are available. For an example:
require 'cltorch'
a = torch.ClTensor(1000,1000):uniform()
cltorch.setEnableTiming(1)
cltorch.setProfiling(1)
a:add(1)
cltorch.synchronize()
print('timings:')
cltorch.dumpTimings()
print('')
print('profiling:')
cltorch.dumpProfiling()
print('')
output:
Using NVIDIA Corporation , OpenCL platform: NVIDIA CUDA
Using OpenCL device: GeForce 940M
Timing activated
Profiling activated
statefultimer v0.7
timings:
dump enabled=1
StatefulTimer readings:
Apply END Apply_1t_1s_0pt_-2_*out += val1: 0.179932ms count=1
Apply compiled: 2.26807ms count=1
Apply getname: 0.0290527ms count=1
Apply got kernel: 0.0700684ms count=1
Apply gotname: 0.0349121ms count=1
THClTEnsor_pointwiseApply END: 0.00512695ms count=1
THClTEnsor_pointwiseApply START: 0.0090332ms count=1
before dump: 0.536865ms count=1
profiling:
Apply_1t_1s_0pt_-2_*out += val1.THClTensor_pointwiseApplyD 0.53648ms
(Also, if you want to call clFinish()
after every kernel launch, wihch makes things slower, but makes the wallclock timings more representative, you can call cltorch.setAddFinish(1)
, eg:
require 'cltorch'
a = torch.ClTensor(1000,1000):uniform()
cltorch.setEnableTiming(1)
cltorch.setProfiling(1)
cltorch.setAddFinish(1)
for i=1,10 do
a:add(1)
end
cltorch.synchronize()
print('timings:')
cltorch.dumpTimings()
print('')
print('profiling:')
cltorch.dumpProfiling()
print('')
Output:
Using NVIDIA Corporation , OpenCL platform: NVIDIA CUDA
Using OpenCL device: GeForce 940M
Timing activated
Profiling activated
AddFinish activated
statefultimer v0.7
timings:
dump enabled=1
StatefulTimer readings:
Apply END Apply_1t_1s_0pt_-2_*out += val1: 5.59424ms count=10
Apply compiled: 2.51904ms count=1
Apply getname: 0.048584ms count=10
Apply got kernel: 0.0800781ms count=10
Apply gotname: 0.0476074ms count=10
THClTEnsor_pointwiseApply END: 0.00927734ms count=10
THClTEnsor_pointwiseApply START: 0.0461426ms count=10
before dump: 0.0168457ms count=1
profiling:
Apply_1t_1s_0pt_-2_*out += val1.THClTensor_pointwiseApplyD 5.25517ms
Thanks hughperkins. That is very useful.
Like UnitTests, does this package have any benchmark tests already available in this package.
Like UnitTests, does this package have any benchmark tests already available in this package.
Yes, sure! In https://github.com/hughperkins/clnn Assuming you have installed using https://github.com/hughperkins/distro-cl , into ~/torch-cl
, you can run like:
cd ~/torch-cl/opencl/clnn
luajit test/test-perf.lua
This is the same scripts as used in soumith's convnet benchmarks
That's great. Thank you so much .
hi ,
i m looking to a benchmark/profiler for cltorch to capture scores . is it already there , if yes, can you tell the procedure on how to capture .
Thanks Subhani