oneapi-src / oneDNN

oneAPI Deep Neural Network Library (oneDNN)
https://uxlfoundation.org
Apache License 2.0
3.61k stars 1k forks source link

[nvidia|amd] Add missing synchronization #1732

Open densamoilov opened 1 year ago

densamoilov commented 1 year ago

Currently, if we call asynchronous API within a host task the event that is tied to the host task completes when the host task completes, rather than when the operation submitted by the asynchronous API call completes. This is a root-cause of the issues like #1703.

In order to fix the issue we need to do proper synchronization within the host task. For example:

    auto e = q.submit([&](sycl::handler &cgh) {
        cgh.host_task([=](const sycl::interop_handle &ih) {
            cudnnAddTensor(handle, ...);
            // Only the stream associated with the handle has to be synchronized
            cudaStream_t currentStreamId;
            cudnnGetStream(handle, &currentStreamId);
            cuStreamSynchronize(currentStreamId);
        });
    });
    e.wait(); // The event completes when `cudnnAddTensor` operation completes.

This fix should be implemented for the cuDNN/cuBLAS and MIOpen/rocBLAS based implementations (use HIP counterparts, e.g. hipStreamSynchronize).

Also, we would need to document a set of limitations coming from the fact that when we do the synchronization it might happen that between an asynchronous API call and cuStreamSynchronize/hipStreamSynchronize call another thread could submit something to the stream in which case the submitted host task will be completed only when that something is completed. It might even lead to a deadlock.

AerialMantis commented 1 year ago

That's right, the approach described here is the only correct way to guarantee the operations enqueued within the host task are synchronized with, following the current SYCL 2020 specification, however, this will likely impact performance. We are working on a SYCL extension to resolve this limitation.

In the meantime, if you wish to achieve better performance, there is another option that will work for DPC++ but be aware it only works under certain conditions and is not guaranteed to work in other SYCL implementations. If the queue is in-order and the same queue is used for all host task submissions then the same underlying CUDA stream will always be used and so the ordering will be forced in-order and no synchronization is required. However, if the queue may also be out-of-order there's no more optimal way to handle this, so you may need to check for this to know whether explicit synchronization is required.

densamoilov commented 1 year ago

@AerialMantis, thanks for the suggestion. Even with in-order queues there is still a synchronization problem that occurs when there are multiple of them and we submit operations that depend on each other to those queues. Though I don't know whether the use case is common.

As for the performance implications, do you know how significant the performance impact would be? Where would it come from? Is it because we would lose amortization of the cost of the asynchronous API calls because we would have to wait for the previous call to complete to make the next one.

AerialMantis commented 6 months ago

@densamoilov apologies for the late reply, I hadn't seen your response. That's right this solution would only work in the case of a single in-order queue, though as it relies on implementation details I would avoid it in general.

The performance impact I described would come from all asynchronous native commands enqueued within the host task function being synchronised with before returning, therefore effectively making the those commands blocking.

There is an extension now which addresses this limitation by extending the host task interface to allow native events to be passed into a host task function and for native events to be propagated out and encapsulated in the SYCL event returned by submit, therefore allowing a the host task function to enqueue asynchronous commands.