halide / Halide

a language for fast, portable data-parallel computation
https://halide-lang.org
Other
5.85k stars 1.07k forks source link

We should have an example or document to cover timing Halide execution. #1547

Open zvookin opened 7 years ago

zvookin commented 7 years ago

This question comes up relatively often. I hope we can build something in to Generator to help with benchmarking, but a top-level document on doing this correctly for JIT, AOT, and when GPUs are involved, would be helpful.

xingjinglu commented 7 years ago

For the generated OpenCL version code, how can I measure the execution time of kernel? Can I get the OpenCL code?

Bastacyclop commented 4 years ago

I would like to benchmark Halide execution on CPU without including memory allocation time, and on GPU without including memory allocations nor initial and final host <-> device communications. Is there a recommended way to do that?

abadams commented 4 years ago

For GPU, just use nvprof to get the individual kernel timings. But if the allocation cache is working right, and you're not doing anything to trigger copies yourself, then Halide should just leave the inputs/outputs in GPU memory. Still, nvprof is the most reliable way to just get the kernel timings.

For CPU we don't really have a way to preallocate all the memory (this is something we've talked about a lot), but normally not a lot of time is spent inside the allocator. You could override halide_malloc with a custom caching allocator. You could also just use perf to check what time is spent where to see if a significant fraction of it is inside the allocator.

So in short, use sampling profilers to see where the time is going.