Open Ruberik opened 10 months ago
API Proposal
CreateEvent
method to Accelerator
class, returning new AcceleratorEvent
class.RecordEvent
and WaitForEvent
methods to AcceleratorStream
class.@m4rs-mt what do you think?
Alternate API Proposal
API Proposal
AddEvent
method to AcceleratorStream
class, returning new AcceleratorEvent
class.WaitForEvent
method to AcceleratorStream
class.This removes the ability to re-record the event.
Alternate API Proposal 2
Instead of exposing a new object, introduce an WaitForStream
method between two streams.
Not sure how this will affect the lifetime of the native handles.
Both your API proposal and your first Alternate API Proposal seem reasonable. The first more closely mimics the underlying API, while the second strikes me as harder to screw up.
The "screw up" I'm thinking of is that when an event is created, I believe the only reasonable action to take with that event is to call RecordEvent() on it (that may, in fact, be the only action that doesn't produce an error). It thus makes some sense to combine creation with the call to RecordEvent(); but I'd be hesitant to make impossible something that's possible in the underlying API, even if most use cases are invalid.
Here's the "screw up" I'm imagining. Imagine a scenario with 8 streams, each of which has three tasks to complete, in order: A, B and C. No stream can start C until every stream's A is done.
It's natural to want to write code like this, but it fails:
var taskAFinishedEvents = Enumerable.Range(0, 8).Select(_ => acc.CreateEvent()).ToArray();
for (int i = 0; i < 8; i++) {
DoTaskA(streams[i]);
streams[i].RecordEvent(taskAFinishedEvents[i]);
DoTaskB(streams[i]);
for (int j = 0; j < 8; j++) {
if (j != i) stream[i].WaitForEvent(taskAFinishedEvents[j]);
}
DoTaskC(streams[i]);
}
The correct code would look like this for your first API proposal:
var taskAFinishedEvents = new AcceleratorEvent[8];
for (int i = 0; i < 8; i++) {
DoTaskA(streams[i]);
taskAFinishedEvents[i] = acc.CreateEvent();
streams[i].RecordEvent(taskAFinishedEvents[i]);
}
for (int i = 0; i < 8; i++) {
DoTaskB(streams[i]);
for (int j = 0; j < 8; j++) {
if (j != i) stream[i].WaitForEvent(taskAFinishedEvents[j]);
}
DoTaskC(streams[i]);
}
The second proposal (Alternate API proposal) would simply replace two lines with taskAFinishedEvents[i] = streams[i].AddEvent();
I don't think I understand the third proposal (Alternate API proposal 2) well enough to know how it would solve this problem.
Thanks for the feedback @Ruberik.
The first alternate proposal was to hide the concept of re-recording an event. Cuda supports it, but I'm not sure about OpenCL.
The second alternate proposal was to hide the concept of an AcceleratorEvent
, since the only reason to have it is to establish a dependency between two streams. This places the burden of managing the underlying native objects on ILGPU.
In your example, it would look more like:
for (int i = 0; i < 8; i++) {
DoTaskA(streams[i]);
for (int j = 0; j < 8; j++) {
if (j != i) stream[i].WaitForStream(stream[j]);
}
}
for (int i = 0; i < 8; i++) {
DoTaskB(streams[i]);
DoTaskC(streams[i]);
}
@MoFtZ, I think that does something different from my code. In my example, a stream finishes A, then works on B, and only then must wait until other streams are done with A. In your code here, nobody can start B until everybody's done A.
BTW, I think there's a problem with that code. Don't you need to enqueue all the As before you can have them wait on each other, like this?
for (int i = 0; i < 8; i++) {
DoTaskA(streams[i]);
}
for (int i = 0; i < 8; i++) {
for (int j = 0; j < 8; j++) {
if (j != i) stream[i].WaitForStream(stream[j]);
}
}
for (int i = 0; i < 8; i++) {
DoTaskB(streams[i]);
DoTaskC(streams[i]);
}
You need three loops here, because you have to have enqueued A, but not B or C, when you call WaitForStream.
Yes, you are correct. Perhaps that proposal is too simplified.
It would need to change the meaning from "wait for all Task A to finish before starting C" to "synchronise the start of all Task C".
As my teammates and I move to multiple, big GPUs with many streams doing interdependent work simultaneously, synchronizing at the GPU level is introducing significant amounts of inefficiency. We'd like to be able to use cudaStreamWaitEvent to make a stream wait on an event. There's already RecordEvent, CreateEvent, DestroyEvent and QueryEvent, so I'm hoping this is a straightforward addition.