intel / llvm

Intel staging area for llvm.org contribution. Home for Intel LLVM-based projects.
Other
1.24k stars 736 forks source link

Host task lacks proper synchronization capabilities when used for interoperability #11284

Open densamoilov opened 1 year ago

densamoilov commented 1 year ago

From SYCL specification: 4.10. Host tasks:

A host task can optionally be used to interoperate with the native backend objects associated with the queue executing the host task, the context that the queue is associated with, the device that the queue is associated with and the accessors that have been captured in the callable, via an optional interop_handle parameter.

This allows host tasks to be used for two purposes: either as a task which can perform arbitrary C++ code within the scheduling of the SYCL runtime or as a task which can perform interoperability at a point within the scheduling of the SYCL runtime.

So one of the main purposes of the host task is to provide users with the capabilities to embed a 3rd party library call into the SYCL programming model. The problem is that the embedding doesn't work properly.

Example of the problem:

    // CUDA backend.
    auto e = q.submit([&](sycl::handler &cgh) {
        cgh.host_task([=](const sycl::interop_handle &ih) {
            cudnnAddTensor(...);
        });
    });
    e.wait(); // the host task is guarantied to be completed but `cudnnAddTensor` is not.

The out of order queue for CUDA and HIP backends is implemented via using multiple CUDA/HIP streams inside a single SYCL queue. In the very beginning the SYCL queue always contained a single CUDA/HIP stream therefore the lack of synchronization was not a big problem at that point as the operations were always executed in order.

After implementing the out of order via multiple streams the problem becomes severe and requires a proper solution otherwise the host task becomes nearly unusable for one of the main purposes - interoperability. On top of that, the profiling capabilities do not work as well. As an option, for CUDA (the same can be used for HIP as well), the problem could be solved via using cudaEvent_t. While submitting the host task SYCL RT could use cudaEventRecord to capture the content of the stream after executing the host task so that when the submitted operation within the host task is executed the cudaEvent_t can notify us about it. The cudaEvent_t can be wrapped in SYCL event and returned to the users. I understand that it's probably hard to automatically identify when SYCL RT should use cudaEventRecord so as an option SYCL can provide some API so that users could let the SYCL RT know about it.

JackAKirk commented 1 year ago

Yeah this is an issue we've come across in a few libraries that assumed in-order queue properties when using a default (out of order queue.). Generally the solution is to use an in-order queue so that we can know there is a single stream that needs to be sync'd.

I think the smart question to ask for such a given application case is firstly whether you can get any advantage from using and out-of-order queue. If not then it is no bother for that application to switch to in-order. If there is not any advantage for a vast majority of applications, then I'd question the value in lots of work implementing such an enhancement that you suggest: You can end up spending a lot of time building a triangular house and then later realize that wasn't what the users needed at all. A bit more detail:

We found that in the majority of HeCbench tests out of order queue made no difference to performance. HeCbench benchmarks are quite often kernel style benchmarks that, despite being relevant for real world applications, often form one part of a real world application. The HeCbench benchmarks that we identified as leading to a speedup with OOO queues mostly related to things like oneDNN, where the solution will be via the Graphs API anyway. This obviously raises the question of the place of out of order queues. But a real answer to the question of whether OOO queues are valuable requires the existence of a benchmark suite that is representative of the (apparently very countable: e.g. it needs to be roughly SIMD, computationally complex, and it needs to be something worth computing in our world) set of application domains used by these GPU backends. As I understand it this is the purpose of Velocity-bench. Hopefully as that gets more fleshed out it will help to establish more clearly whether things like OOO queues are really useful to people, as opposed to using the new graph API or just in-order queues.

Our inital benchmarking suggested that it isn't a big priority.

mehdi-goli commented 1 year ago
densamoilov commented 1 year ago

@JackAKirk, I think there are two things to consider:

While performance issues are something that can be prioritized based on the demand correctness issues should always be the top priority because a slow workflow is not critical while a broken workflow usually is. This issue is rather about correctness than about performance.

Typically it's up to the users to figure out whether their workloads can benefit from using the out-of-order queue. As a library, it is our responsibility to provide users with the correctly working functionality.

Even if we say that we (as a library) do not support out-of-order queues for CUDA and HIP backends it won't solve the problem because of the following:

Since the goal is to provide CUDA users with an alternative we have to support what CUDA users expect us to support (at least the 2 things above are missing now). So given the goal, I would suggest you to prioritize fixing the issues.

JackAKirk commented 1 year ago

I believe the issue is that cudnn calls are often asynchronous. If you use in-order queue only a single stream is used, which makes the execution synchronous automatically. If you use either multiple cuda streams, multiple in-order sycl queues or an out of order sycl queue, in tandem with asynchronous cudnn calls, and expect that the execution of follow on kernels to be synchronous wrt these cudnn calls, then I believe that you need to do one of two things, and you can choose based on your use case:

I think that the above is the expected way of dealing with interoperability and I don't think that that any of this is in contradiction to the SYCL spec that you have quoted. However I don't know what you mean by event profiling being broken. If you could expand on that then I think we can see if there is something we can fix via the runtime.

densamoilov commented 1 year ago

Using cuStreamSynchronize would work for CUDA and probably HIP backend. While it works for CUDA/HIP it wouldn't seem to work for OpenCL/Level Zero and any other backend that supports out-of-order streams out of the box because such an approach requires the native stream to be always in order.

However I don't know what you mean by event profiling being broken.

That was based on the assumption that the host task execution time is not equal to cuDNN operation execution time which makes the information queried from events useless. Let me try the fix and check whether the profiling information is correct.

UPDATE: The event timings are incorrect even with the fix because the event doesn't know when the operation started.

densamoilov commented 1 year ago

One more thing. When I use cuStreamSynchronize it might happen that between cuDNN API and cuStreamSynchronize calls another thread can submit something to the stream in which case the submitted host task will be completed only when that something is completed. It might even lead to a deadlock. Do you know if there is a way to avoid it?

For example, in the code below, assuming queue.get_native(...) returns the same stream, it might happen that 2 cudnnAddTensor operations are submitted and only then cuStreamSynchronize(...) is called. In this case the host tasks will take much more time to complete.

    // CUDA backend.
    q.submit([&](sycl::handler &cgh) {
        cgh.host_task([=](const sycl::interop_handle &ih) {
            cudnnAddTensor(...);
            cuStreamSynchronize(...);
        });
    });
    q.submit([&](sycl::handler &cgh) {
        cgh.host_task([=](const sycl::interop_handle &ih) {
            cudnnAddTensor(...);
            cuStreamSynchronize(...);
        });
    });
    q.wait();
JackAKirk commented 1 year ago

Using cuStreamSynchronize would work for CUDA and probably HIP backend. While it works for CUDA/HIP it wouldn't seem to work for OpenCL/Level Zero and any other backend that supports out-of-order streams out of the box because such an approach requires the native stream to be always in order.

However I don't know what you mean by event profiling being broken.

That was based on the assumption that the host task execution time is not equal to cuDNN operation execution time which makes the information queried from events useless. Let me try the fix and check whether the profiling information is correct.

UPDATE: The event timings are incorrect even with the fix because the event doesn't know when the operation started.

Have you tried something like this, which I think should work:

CUevent cudaEvent;
    // CUDA backend.
    auto e = q.submit([&](sycl::handler &cgh) {
        cgh.host_task([=](const sycl::interop_handle &ih) {
            cudaEvent=cudnnAddTensor(...);
        });
    });
    e.wait(); // the host task is guarantied to be completed but `cudnnAddTensor` is not.
sycl::event interopEvent = make_event<sycl::backend::ext_oneapi_cuda>(cudaEvent);
// Then do whatever you want with `interopEvent`, timings, synch, etc

@AerialMantis is going to be able to answer the spec questions much better than me. I may be wrong, but I thought that it is not expected that the sycl::event e in the above code would map to timings of the native cuda kernels that are called by the interop host task. As you say, once the host task has finished executing the contents of its "kernel", event e has finished. I thought that the point was that once you interop to e.g. native cuda, you live in cuda land, and the sycl runtime is not responsible for e.g. kernel timings etc. Whatever is right, it would be nice if the spec was clearer as you say, but then it seems likely that all this is subject to change. The above is me just trying to explain what I think you can do as things stand.

I don't know about the nuts and bolts of what happens with level_zero and I can understand that there might be things you want/need that isn't yet available in SYCL 2020. Have you looked at whether sycl graphs has solutions to any of the issues you are facing? I know relatively little about sycl graphs but I was under the impression sycl graphs would be used in things like oneDNN and imagine that it will also have to interact with host tasks. wrt future changes to SYCL I wonder whether sycl graphs is the correct place to look. But these are just my guesses, I'd recommend considering speaking to someone from the sycl graphs team.

JackAKirk commented 1 year ago

One more thing. When I use cuStreamSynchronize it might happen that between cuDNN API and cuStreamSynchronize calls another thread can submit something to the stream in which case the submitted host task will be completed only when that something is completed. It might even lead to a deadlock. Do you know if there is a way to avoid it?

For example, in the code below, assuming queue.get_native(...) returns the same stream, it might happen that 2 cudnnAddTensor operations are submitted and only then cuStreamSynchronize(...) is called. In this case the host tasks will take much more time to complete.

    // CUDA backend.
    q.submit([&](sycl::handler &cgh) {
        cgh.host_task([=](const sycl::interop_handle &ih) {
            cudnnAddTensor(...);
            cuStreamSynchronize(...);
        });
    });
    q.submit([&](sycl::handler &cgh) {
        cgh.host_task([=](const sycl::interop_handle &ih) {
            cudnnAddTensor(...);
            cuStreamSynchronize(...);
        });
    });
    q.wait();

Yeah good question. I think at the moment host tasks do not have a thread pool (again my memory may fail me), so I think what you describe can only happen if the user explicitly uses multithreading. In such cases I thinks you have to use locks like in https://github.com/intel/llvm/pull/6102, and also APIs like cuStreamQuery will probably be useful.

densamoilov commented 1 year ago

I think at the moment host tasks do not have a thread pool (again my memory may fail me), so I think what you describe can only happen if the user explicitly uses multithreading.

My understanding is different. From what I saw in the past a submitted host task was executed in a separate TBB thread. Within oneDNN, it's probably possible to make sure that the submission is performed in the correct order but what happens outside oneDNN is out of our control. Nothing seems to be preventing the users from submitting more work to the queue in separate threads. I don't see how this problem can be solved in general without new CUDA functionality (e.g. exclusive access to stream, something like cuStreamLock(stream)). Also, the issue is not specific to using cuStreamSynchronize. In CUDA the events are effectively stream markers and are not returned by the cuDNN API. For example:

CUevent cudaEvent;
cudaEventCreate(&cudaEvent);
auto cuda_stream = sycl::get_native<sycl::backend::ext_oneapi_cuda>(q);
// Handle is attached to the `cuda_stream`
cudnnAddTensor(handle, ...);
cudaEventRecord(cudaEvent, cuda_stream);
// Now `cudaEvent` can be used to wait in the `cudnnAddTensor`.

In the code above we still have the same race condition problem as with cuStreamSynchronize.

I may be wrong, but I thought that it is not expected that the sycl::event e in the above code would map to timings of the native cuda kernels that are called by the interop host task.

You are probably right because obtaining info::event_profiling::command_start and info::event_profiling::command_end information for host task doesn't formally require any events. It's just that for the interop cases the host task timings are not really useful in practice. That's why I think it would make sense to consider introducing a way to tell SYCL RT that we want the timings to be based on the events. Something like:

    auto e = q.submit([&](sycl::handler &cgh) {
        cgh.host_task([=](const sycl::interop_handle &ih) {
            cudaEventRecord(cuda_event_start, ...);
            cudnnAddTensor(...);
            cudaEventRecord(cuda_event_stop, ...);
            ih.register_events(cuda_stream, {cuda_event_start, cuda_event_stop});
        });
    });
    e.wait(); // here `e` contains the CUDA events and therefore can be used to query profiling info
              // and for the waiting`.

You also mentioned that we could use interop API to create a SYCL event: sycl::event interopEvent = make_event<sycl::backend::ext_oneapi_cuda>(cudaEvent);. Given that the events are markers (so there should be 2 CUDA events) and the CUDA wait function for events requires passing a CUDA stream it seems that the interop API you suggested creates a SYCL event that cannot be queried for profiling information (obtained from CUDA event) and I'm not sure how sycl::event::wait() function is implemented. Can you please clarify that?

I don't know about the nuts and bolts of what happens with level_zero and I can understand that there might be things you want/need that isn't yet available in SYCL 202

I mentioned Level Zero and OpenCL just as an example. These backends do not provide cuStreamSynchronize like API therefore we would probably have to create a SYCL event with the interop API. This is just something to keep in mind. We don't have host task use cases for these backends.


A short summary of the discussion at this point:

JackAKirk commented 1 year ago

I think at the moment host tasks do not have a thread pool (again my memory may fail me), so I think what you describe can only happen if the user explicitly uses multithreading.

My understanding is different. From what I saw in the past a submitted host task was executed in a separate TBB thread.

You could well be right. I know that thread pools were involved at some point. On an aside for this particular point: I think it is worth pointing out that I think even if you have thread pools you can deal with this by using the host task events and waiting on them: then a second host task can only execute once the first has at least executed all the potentially asynchronous interop tasks. I think that even if an impl uses thread pools it would have to be constrained by this to be spec compliant.

    // CUDA backend.
    sycl::event hostEvent = q.submit([&](sycl::handler &cgh) {
        cgh.host_task([=](const sycl::interop_handle &ih) {
            cudnnAddTensor(...);
            cuStreamSynchronize(...);
        });
    });
    q.submit([&](sycl::handler &cgh) {
cgh.depends_on(hostEvent);
        cgh.host_task([=](const sycl::interop_handle &ih) {
            cudnnAddTensor(...);
            cuStreamSynchronize(...);
        });
    });
    q.wait();

On your general points about profiling and dealing with native events such that native event info can be passed to sycl runtime etc; I see your points. I have also discovered that essentially the same idea has already been proposed and discussed to the SYCL working group: in a nutshell the proposed change was that host_tasks could return an optional vector of native events, similar to what you propose. Apparently there was some problems raised with this approach however.

I think it is worth going a little further and considering whether such usages of host tasks, particularly for the deep learning library example in question, would also be helped by some sycl graph support. I.e would there be any advantage to "define a oneDNN dependency graph to the SYCL runtime prior to execution". Some potential advantages are listed in this spec document: https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc

I can see the answer to this question may be in the negative since for example on the question of "Can multiple e.g. cudnn calls be fused?": Since then are called from host directly I guess this is generally not possible. But my guess here could well be wrong since I know little about it.

But I think questioning whether sycl graphs should extend support to host_task is a valid question to consider before diving into worrying about how to better support sycl OOO queues for host_task interop. It is easy to dive into an implementation and then realize that it turns out e.g. that sycl OOO queues are not really fit for this purpose and SYCL Graphs could be better.

For example, one complication I imagine with interacting well between interop host_tasks and SYCL OOO Queue task submissions, is that in the impls of OOO queues, the mapping of queue task submissions to the underlying native queues (CUstream for cuda) is complex, and (as it is currently implemented) cannot be a priori known by the submitter. I imagine that the mapping of native streams to sycl graph task submissions is better defined, so that assuming the knowledge of this mapping could be put to use by an application, it might put sycl OOO queues at a disadvantage. I will ask the SYCL graphs team for their views on whether sycl graphs should have some support for host_tasks, and whether such support would be suitable for oneDNN.

AerialMantis commented 1 year ago

I think I can add some context here for what the SYCL 2020 specification currently supports and how we'd like to improve this going forward.

So as @JackAKirk mentioned the issue lies in what the SYCL event returned from a host task command represents. Currently, that event only represents the completion of the host task function, so in the following example:

auto e = q.submit([&](sycl::handler &cgh) {
  auto ht = [=](const sycl::interop_handle &ih) {
     cudnnAddTensor(...);
  };
  cgh.host_task(ht);
});
e.wait();

the call to wait only waits for the completion of ht and not the completion of any asynchronous commands enqueued to the native stream. This is the reason why there is no synchronization with the cudnnAddTensor operation and also why the event profiling doesn't provide any useful information.

This is a limitation in the current design as it means that any native operations enqueued within a host task function must be synchronized with within the same host task function in order to guarantee dependencies are handled correctly, as in the example below:

auto e = q.submit([&](sycl::handler &cgh) {
  auto ht = [=](const sycl::interop_handle &ih) {
    cudnnAddTensor(...);
    cuStreamSynchronize(...);
  };
  cgh.host_task(ht);
});
e.wait();

There are some workarounds to this; either to use an in-order queue or to synchronize with the queue before enqueueing another host task, both of these will work in DPC++ when targeting Nvidia, however, these rely on implementation details so they are not guaranteed to work on all backends or with other SYCL implementations. I'm also not too familiar with the Level Zero adapter implementation so can't really comment on that.

We are currently working on an extension that would work along the lines of what you suggest, there would be a new interface for the interop_handle which would allow you to add native events, and these would then be added to the SYCL event which is returned from submit. This supports both conditionally adding a dependency on native events and adding multiple events. However, there is a further complication to this, in order for this to work, we need to be able to change when the host task is run, in order for the SYCL event to incorporate the native events, it needs to be run at submission time rather than DAG execution time.

I agree with your conclusion that the best option currently is to use cuStreamSynchronize within the host task function, unfortunately, this will likely have an impact on performance relative to how we would prefer to do this, but it will guarantee correctness, while we wait for a more ideal solution.

JackAKirk commented 1 year ago

I've brought this up with sycl-graph team. They say they are open to discuss host_task interop support in their meetings with the oneDNN team.

chrisreedge commented 1 year ago

We have seen run-to-run inconsistency with our OneAPI/oneDNN applications (utilizing Nvidia backend) due to this issue as well, particularly when we are running parallel OpenCL/OneAPI GPU processes. I think especially with the Nvidia backend, the behavior should match the native Nvidia in-order queue/single stream by default.

Is there currently a control when creating an engine/stream/queue to force the in-order queue behavior? This would be a very helpful switch to have so we're not stuck with the out-of-order queue/multi-stream behavior currently.

With the out-of-order queue functionality, I agree that the synchronization overall needs to be fixed. It would be nice to be able to disable this behavior in the meantime.

densamoilov commented 1 year ago

Hi @chrisreedge, oneDNN provide a way to create different streams using the flags. Previously, oneDNN ignored the flags for the NVIDIA backend and always created an out-of-order stream. That was recently fixed in https://github.com/oneapi-src/oneDNN/commit/615236ce07261b622c5fe10de91e2832c4c4e985.

Feel free to ask more questions in the oneDNN repository: https://github.com/oneapi-src/oneDNN.