intel / llvm

Intel staging area for llvm.org contribution. Home for Intel LLVM-based projects.
Other
1.22k stars 732 forks source link

HIP profiling submission time query returns weird values #12904

Open steffenlarsen opened 7 months ago

steffenlarsen commented 7 months ago

Describe the bug

In https://github.com/intel/llvm/pull/12838 it seems like the submission time on HIP is giving weird values. I did a bit of digging and it seems to me like HIP is a little different from CUDA when checking timing-differences between events. Of particular interest here is the following line for hipEventElapsedTime():

Events which are recorded in a NULL stream will block until all commands on all other streams complete execution, and then record the timestamp.

While what we expect here is to get an event with the current time, hence using an otherwise unused stream. Fixing it might be outside the scope of this PR, but a possible solution could be to lazily have a stream specifically for recording submission time of events, tied to the context. Similar could be used in the CUDA backend to avoid the assumption noted above.

_Originally posted by @steffenlarsen in https://github.com/oneapi-src/unified-runtime/pull/1400#discussion_r1508693119_

To reproduce

No response

Environment

No response

Additional context

This affects https://github.com/intel/llvm/pull/12838, but should be reproducible on normal profiling queues. When this is fixed, sycl/test-e2e/ProfilingTag/ tests should be enabled for HIP.

konradkusiak97 commented 6 months ago

Hi @steffenlarsen, I've been trying to reproduce this issue and have been playing with the fix for it but now I concluded that we're already doing what you've suggested. I think the relevant line in the hip adapter is here: https://github.com/oneapi-src/unified-runtime/blob/a504ead8b9fa3b70ca90d40e21bd417eee2f204b/source/adapters/hip/event.cpp#L64

We're calling hipEventRecord with Queue->get() which invokes Queue->getNextComputeStream and that provides a new stream for the recording. However, we do get reports about bad event recordings in hip so I'm still questioning if there is more to it. Do you have any reproducible for your issue that I could try out?

So far I've been trying to check if the following two asynchronous events would give the same timings, even if we change their order:

  event e = q.submit([&](handler &h) {
    h.parallel_for(array_size_small, [=](id<1> i) {
      sum[i] = a[i] + b[i]; 
   });
  });

  event e2 = q.submit([&](handler &h) {
    h.parallel_for(array_size_big, [=](id<1> i) {
      sumTwo[i] = c[i] + d[i];
    });
  });

Assuming that the second event runs much longer than the first one if we're using the NULL stream, the first event will wait until the second is finished (according to what you've found above). But that's not what I've seen, the kernel with array_size_small gives the same timing results no matter if it runs in the first or the second event. So the root cause might be somewhat more complicated.

steffenlarsen commented 5 months ago

@konradkusiak97 - I do not know for certain if the exact issue is as I described, but I would say the quote is a glaring issue compared to the semantics of the corresponding CUDA interfaces. The intention of evQueued is to record the "device event" representing the time of the call to the given command. In CUDA that is okay to do with the NULL-stream, assuming it is unused (which itself can be a pretty volatile assumption) as it should just record the "now". However, based on the quote above from the HIP documentation, events recorded to the NULL-stream in HIP does not work that way and so even if there are other issues, this is definitely an issue as well.

Arguably evBase could also be an issue (https://github.com/oneapi-src/unified-runtime/blob/a504ead8b9fa3b70ca90d40e21bd417eee2f204b/source/adapters/hip/context.cpp#L58) if there are non-SYCL related work on HIP in the program, but I suspect that's much less likely.

For tests showing weird behavior, you could try enabling sycl/test-e2e/ProfilingTag tests for HIP.

konradkusiak97 commented 4 months ago

You're right, for EvQueued it definitely sounds like it shouldn't be on the NULL stream for HIP. I`ve finally got what you've meant :slightly_smiling_face: I made the changes to it here: https://github.com/oneapi-src/unified-runtime/pull/1634

steffenlarsen commented 3 months ago

Thank you, @konradkusiak97! Would you mind checking if we can reenable the profiling tag tests on HIP?

konradkusiak97 commented 3 months ago

Checked it a while ago actually, they all seem to pass on gfx90a :+1: