krrishnarraj / clpeak

A tool which profiles OpenCL devices to find their peak capacities
Apache License 2.0
386 stars 109 forks source link

[src] use CL_PROFILING_COMMAND_END as latency time #67

Open alohali opened 4 years ago

alohali commented 4 years ago

CL_PROFILING_COMMAND_END - CL_PROFILING_COMMAND_QUEUED is real kernel latency

alohali commented 4 years ago

Is it more accurate to test kernel latency with CL_PROFILING_COMMAND_END - CL_PROFILING_COMMAND_QUEUED and run a extreme small kernel? see >20us difference on several ARM MALI GPU device.

krrishnarraj commented 4 years ago

Thanks. I agree with the small kernel part. I am seeing more latency for cpu platforms like pocl. How can 'CL_PROFILING_COMMAND_END - CL_PROFILING_COMMAND_QUEUED' give better accuracy wrt CL_PROFILING_COMMAND_START?

alohali commented 4 years ago

Thanks. I agree with the small kernel part. I am seeing more latency for cpu platforms like pocl. How can 'CL_PROFILING_COMMAND_END - CL_PROFILING_COMMAND_QUEUED' give better accuracy wrt CL_PROFILING_COMMAND_START?

Because kernel launch latency contains pre-launch, post-launch latency and other execution latency. CL_PROFILING_COMMAND_START - CL_PROFILING_COMMAND_QUEUED only calculates pre launch parts but not post launch parts. CL_PROFILING_COMMAND_END - CL_PROFILING_COMMAND_QUEUED includes both pre and post. The real kernel execution time is almost zero.

nchristensen commented 1 year ago

From https://stackoverflow.com/questions/39924433/opencl-events-ambiguity it seems to me that CL_PROFILING_COMMAND_SUBMIT - CL_PROFILING_COMMAND_START is the pre-execution latency. CL_PROFILING_COMMAND_COMPLETE was added in OpenCL 2.0. I'm guessing CL_PROFILING_COMMAND_COMPLETE - CL_PROFILING_COMMAND_END is the post-execution latency.

There may also a lower bound on CL_PROFILING_COMMAND_END - CL_PROFILING_COMMAND_START which might be another form of latency.

So CL_PROFILING_COMMAND_COMPLETE - CL_PROFILING_COMMAND_SUBMIT on very small kernel may be a way to measure the latency.