Feature request: cudaStreamWaitEvent

Ruberik commented 10 months ago

As my teammates and I move to multiple, big GPUs with many streams doing interdependent work simultaneously, synchronizing at the GPU level is introducing significant amounts of inefficiency. We'd like to be able to use cudaStreamWaitEvent to make a stream wait on an event. There's already RecordEvent, CreateEvent, DestroyEvent and QueryEvent, so I'm hoping this is a straightforward addition.

MoFtZ commented 10 months ago

API Proposal

Add CreateEvent method to Accelerator class, returning new AcceleratorEvent class.
Add RecordEvent and WaitForEvent methods to AcceleratorStream class.

@m4rs-mt what do you think?

MoFtZ commented 10 months ago

Alternate API Proposal

API Proposal

Add AddEvent method to AcceleratorStream class, returning new AcceleratorEvent class.
AddWaitForEvent method to AcceleratorStream class.

This removes the ability to re-record the event.

MoFtZ commented 10 months ago

Alternate API Proposal 2

Instead of exposing a new object, introduce an WaitForStream method between two streams.

Not sure how this will affect the lifetime of the native handles.

Ruberik commented 10 months ago

Both your API proposal and your first Alternate API Proposal seem reasonable. The first more closely mimics the underlying API, while the second strikes me as harder to screw up.

The "screw up" I'm thinking of is that when an event is created, I believe the only reasonable action to take with that event is to call RecordEvent() on it (that may, in fact, be the only action that doesn't produce an error). It thus makes some sense to combine creation with the call to RecordEvent(); but I'd be hesitant to make impossible something that's possible in the underlying API, even if most use cases are invalid.

Here's the "screw up" I'm imagining. Imagine a scenario with 8 streams, each of which has three tasks to complete, in order: A, B and C. No stream can start C until every stream's A is done.

It's natural to want to write code like this, but it fails:

var taskAFinishedEvents = Enumerable.Range(0, 8).Select(_ => acc.CreateEvent()).ToArray();
for (int i = 0; i < 8; i++) {
  DoTaskA(streams[i]);
  streams[i].RecordEvent(taskAFinishedEvents[i]);
  DoTaskB(streams[i]);
  for (int j = 0; j < 8; j++) {
    if (j != i) stream[i].WaitForEvent(taskAFinishedEvents[j]);
  }
  DoTaskC(streams[i]);
}

The correct code would look like this for your first API proposal:

var taskAFinishedEvents = new AcceleratorEvent[8];
for (int i = 0; i < 8; i++) {
  DoTaskA(streams[i]);
  taskAFinishedEvents[i] = acc.CreateEvent();
  streams[i].RecordEvent(taskAFinishedEvents[i]);
}
for (int i = 0; i < 8; i++) {
  DoTaskB(streams[i]);
  for (int j = 0; j < 8; j++) {
    if (j != i) stream[i].WaitForEvent(taskAFinishedEvents[j]);
  }
  DoTaskC(streams[i]);
}

The second proposal (Alternate API proposal) would simply replace two lines with taskAFinishedEvents[i] = streams[i].AddEvent();

I don't think I understand the third proposal (Alternate API proposal 2) well enough to know how it would solve this problem.

MoFtZ commented 10 months ago

Thanks for the feedback @Ruberik.

The first alternate proposal was to hide the concept of re-recording an event. Cuda supports it, but I'm not sure about OpenCL.

The second alternate proposal was to hide the concept of an AcceleratorEvent, since the only reason to have it is to establish a dependency between two streams. This places the burden of managing the underlying native objects on ILGPU.

In your example, it would look more like:

for (int i = 0; i < 8; i++) {
  DoTaskA(streams[i]);
  for (int j = 0; j < 8; j++) {
    if (j != i) stream[i].WaitForStream(stream[j]);
  }
}
for (int i = 0; i < 8; i++) {
  DoTaskB(streams[i]);
  DoTaskC(streams[i]);
}

Ruberik commented 10 months ago

@MoFtZ, I think that does something different from my code. In my example, a stream finishes A, then works on B, and only then must wait until other streams are done with A. In your code here, nobody can start B until everybody's done A.

BTW, I think there's a problem with that code. Don't you need to enqueue all the As before you can have them wait on each other, like this?

for (int i = 0; i < 8; i++) {
  DoTaskA(streams[i]);
}
for (int i = 0; i < 8; i++) {
  for (int j = 0; j < 8; j++) {
    if (j != i) stream[i].WaitForStream(stream[j]);
  }
}
for (int i = 0; i < 8; i++) {
  DoTaskB(streams[i]);
  DoTaskC(streams[i]);
}

You need three loops here, because you have to have enqueued A, but not B or C, when you call WaitForStream.

MoFtZ commented 10 months ago

Yes, you are correct. Perhaps that proposal is too simplified.

It would need to change the meaning from "wait for all Task A to finish before starting C" to "synchronise the start of all Task C".

m4rs-mt / ILGPU

Feature request: cudaStreamWaitEvent #1139