Weird behavior of test_events and test_profiling CTS tests

llmike commented 4 years ago

I tested this using master branch on ApolloLake and Skylake GPUs, result is identical between these two platforms. Here is more details:

test_events in waitlists subtests complains about too asynchronous behavior, no SUBMISSION event, just COMPLETE:

-- Testing NDRangeKernel (waiting on 1 event)... Executing reference event 0, then test event 2 with reference event 0 in its waitlist. Event status after starting reference events: reference event 0: CL_SUBMITTED, reference event 1: N/A, test event 2: N/A. Event status after starting test event: reference event 0: CL_SUBMITTED, reference event 1: N/A, test event 2: CL_SUBMITTED. Event status after waiting for reference event 0: reference event 0: CL_COMPLETE, reference event 1: N/A, test event 2: CL_COMPLETE. WARNING: Test event already completed. Assumed valid.

test_profiling fails in many tests absolutely in the random manner, here is an example:

CL_PROFILING_COMMAND_QUEUED: 9230171918039386 CL_PROFILING_COMMAND_SUBMIT: 9230171919039386 CL_PROFILING_COMMAND_START: 9230171918702511 CL_PROFILING_COMMAND_END: 9230171918713761 CL_DEVICE_PROFILING_TIMER_RESOLUTION: 52 Profiling info: Time from queue to submit : 1.000000ms Time from submit to start : 18446744073709.218750ms Time from start to end: 0.011250ms CL_PROFILING_COMMAND_SUBMIT > CL_PROFILING_COMMAND_START. Time between submit and start is too big: 18446744073.709217s, test limit: 30.000000s.

It seems that causation of events is heavily out-of-order.

Thanks!

bfliflet commented 4 years ago

Thanks, Mike, for reporting this. Are you using CLQUEUEOUTOFORDEREXECMODE_ENABLE? This influences some driver scheduling behaviors. I assume this is on Linux. Fwiw, we also have a slightly different behavior on Windows and Linux caused by high submission overhead on Windows.

We do have a limited precision timer for profiling that can wrap between START/END. We should be handling the wrap case as long as the kernel duration doesn't cause it to wrap more than once.

llmike commented 4 years ago

It's Khronos's CTS test (conformance tests), to be honest I'm not sure what they use, didn't look into the code. And unfortunately it's not Linux, I'm currently in the progress of port all OpenCL stack for Intel GPUs to QNX. In general all stuff works pretty well and was tested on ApolloLake and Skylake currently, KabyLake is on the way.

I'm trying to separate all issues what could bring QNX port from generic issues. I did a quick look onto the source code of compute runtime, there are timestamp queries in the code which are POSIX calls, so, perhaps, these issues were not brought by port process.

I have one more CTS test failure, but currently I'm not sure it is GPU issue or OpenCL driver's one, it's related to floating point roundings.

gwawiork commented 4 years ago

Hi regarding profiling tests. The test calls commands https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/OpenCL_API.html#clGetEventProfilingInfo Looks like your timestamps returned by particular cl_profiling_info are incorrect. Do you have chance to track this path in your code and verify that module responsible for returning timestamps do it in correct way? Our internal tests on SKL/APL have correct results.

gwawiork commented 4 years ago

Regarding test_events waitlists tests We observe the same warnings as you described in our internal results. It looks that the test start commands here, then only checks if command is queued or submitted here , then if running here and finally if completed here . This is the place where the warning is printed. After that line code is waiting for event status[2] in case if not completed yet. I think this warning is ok and test works correctly on your setup.

AdamCetnerowski commented 4 years ago

Closing as answer has been provided. If you have follow up questions, please re-open or start a new issue.

llmike commented 4 years ago

Hi, I'd like to reopen this issue/bug. I tracked usage of timestamp in the code and it seems that for COMMAND_START and COMMAND_END the GPU timestamps were used, while for other timestamps the OS time was used. The algorithm which adjust CPU time using GPU deltas doesn't work properly, some adjusted GPU timestamps appeared in the past.

llmike commented 4 years ago

Thank you for re-opening it. This time I came with solution. The main issue is in function calculateProfilingDataInternal() at /opencl/source/event/event.cpp. It performs delta calculations using different timers CPU and GPU what is logically not correct.

For example, resolution time on CPU is 1ms or 1,000,000ns, resolution of GPU timer is 83.33ns on SkyLake. Whenever two timers are using in one equation for delta calculation we produce unknown GAP of 1,000,000+/-83.33 ns. We don't know where CPU timer is exactly, it just switched to the new millisecond, it is in the middle or right before to switch to the next millisecond. So delta between CPU and GPU is pointless at most times.

I modified the code to calculate proper GPUs deltas and apply them to the CPU timestamps which are having less resolution. This fixed all problems I described in the initial post. And always provide a robust results.

CL_PROFILING_COMMAND_QUEUED: 3725440318000557 CL_PROFILING_COMMAND_SUBMIT: 3725440319000557 CL_PROFILING_COMMAND_START: 3725440319535140 CL_PROFILING_COMMAND_END: 3725440319548973 CL_DEVICE_PROFILING_TIMER_RESOLUTION: 83

But QUEUED and SUBMIT timestamps are fully CPU based which is incorrect, because PROFILING_TIMER_RESOLUTION mention GPU resolution, not CPU. So we can clearly see that difference between SUBMIT and QUEUED is still 1ms which is very unprecise timings.

I would consider removing usage of CPU timestamps for profiling at all and use GPU everywhere.

Thanks. Here is my proposed patch to fix current issue when START timestamp is prior the SUBMIT and QUEUED timestamps.

I reposted proposed patch to pastebin, because I can't post it unmodified by board engine: https://pastebin.com/dMuVhqET

MichalMrozek commented 4 years ago

This is interesting, CPU should have resolution in nano seconds , not in milliseconds. Why you observe cpu resolution at 1 ms ?

Can you try to change clock type used in here :

bool OSTimeLinux::getCpuTime(uint64_t *timestamp) {
    struct timespec ts;

    if (getTimeFunc(CLOCK_MONOTONIC_RAW, &ts)) {
        return false;
    }

and here

double OSTimeLinux::getHostTimerResolution() const {
    struct timespec ts;
    if (resolutionFunc(CLOCK_MONOTONIC_RAW, &ts)) {
        return 0;
    }

For:

https://linux.die.net/man/3/clock_gettime

CLOCK_REALTIME
System-wide realtime clock. Setting this clock requires appropriate privileges.
CLOCK_PROCESS_CPUTIME_ID
High-resolution per-process timer from the CPU.
CLOCK_THREAD_CPUTIME_ID
Thread-specific CPU-time clock.

And report if any has better resolution on your setup then default CLOCK_MONOTONIC_RAW? Looks like CLOCK_PROCESS_CPUTIME_ID may be better candidate then CLOCK_MONOTONIC_RAW

llmike commented 4 years ago

I worked on QNX port of all compute runtime stuff, it's not Linux. In overall the port is good, but this issue was the last show stopper to pass all CTS tests.

QNX doesn't have MONOTONIC_RAW, only MONOTONIC and it is milliseconds based, not nanoseconds. https://www.qnx.com/developers/docs/7.0.0/#com.qnx.doc.neutrino.prog/topic/timing.html

The main idea for this - to remove trips to the kernel (kernel calls) to request CPU ticks, since they are updated by OS scheduler.

In general I can move everything to fractions of nanoseconds using synchronized CPU's TSC between multiple cores, but GPU ticks look more preferable for this as for me.

MichalMrozek commented 4 years ago

Obtaining GPU timestamps for QUEUED/SUBMIT is expensive process, we need to go to KMD which reads GPU registers, that takes significant amount of time and makes profiling less accurate and with high dose of overhead. Also we need correlation with CPU timestamps so those GPU timestamps needs to be moved to CPU domain anyway. So if we want to make any driver changes, I would rather use more accurate CPU clock that is available in your system , rather then moving to GPU timestamps for everything.

llmike commented 4 years ago

The proposed patch didn't make any new trips to the KMD/kernel, it just re-uses the data obtained anyway. If it is so expensive under Linux any requests of GPU timestamps extends CPU timestamps by this amount of time required to make a trip to the kernel because compute runtime already read these values.

MichalMrozek commented 4 years ago

We do not query GPU timestamp for SUBMITTED, we only query for QUEUED. So submitTimeStamp.GPUTimeStamp is not available. (it is 0 actually). This was changed recently in order to optimize profiling.

AdamCetnerowski commented 3 years ago

I think all the questions have been resolved. If followup is needed, please reopen

intel / compute-runtime

Weird behavior of test_events and test_profiling CTS tests #344