Open zvookin opened 8 years ago
For the generated OpenCL version code, how can I measure the execution time of kernel? Can I get the OpenCL code?
I would like to benchmark Halide execution on CPU without including memory allocation time, and on GPU without including memory allocations nor initial and final host <-> device communications. Is there a recommended way to do that?
For GPU, just use nvprof to get the individual kernel timings. But if the allocation cache is working right, and you're not doing anything to trigger copies yourself, then Halide should just leave the inputs/outputs in GPU memory. Still, nvprof is the most reliable way to just get the kernel timings.
For CPU we don't really have a way to preallocate all the memory (this is something we've talked about a lot), but normally not a lot of time is spent inside the allocator. You could override halide_malloc with a custom caching allocator. You could also just use perf to check what time is spent where to see if a significant fraction of it is inside the allocator.
So in short, use sampling profilers to see where the time is going.
This question comes up relatively often. I hope we can build something in to Generator to help with benchmarking, but a top-level document on doing this correctly for JIT, AOT, and when GPUs are involved, would be helpful.