A few bugfixes and a performance optimization

We can get a pretty decent perf win by batching tasks that only depend on each other into the same D3D command list. Previously we followed a strict state machine model where a task went through the following stages:

Queued (submitted to a queue)
Submitted to the device (the queue was flushed)
Ready (all dependencies have completed)
Running (recorded into a command list and submitted to the device)
Complete (the CPU has been notified that the work is done)

This change adjusts the model slightly, modifying stage 3. A task can now be moved into the ready state as long as all dependent tasks are on the same device, and ready - then we'll batch together larger groups of ready tasks (e.g. an entire flushed command queue) instead of doing a full CPU -> GPU -> CPU round trip for each independent work item from an in-order queue.

Technically, this violates the OpenCL API spec for events:

CL_RUNNING: Indicates that the device has started executing this command. In order for the execution status of an enqueued command to change from CL_SUBMITTED to CL_RUNNING, all events that this command is waiting on must have completed successfully i.e. their execution status must be CL_COMPLETE.

We'll end up marking an event as CL_RUNNING even if its dependences are also only CL_RUNNING and not CL_COMPLETE, because we don't have fine-grained tracking of work within a command list, so we can't be notified as the GPU moves from one task to the next if they were batched together. The other option would be to defer marking any event as running, and only upon completion, iterate through all events and in sequence mark them as running and then complete, but that seems even worse. Regardless, the CL CTS for events didn't seem to mind this behavior.

There's also a couple bugfixes in here for a use-after-free/leak and a multi-device regression.

microsoft / OpenCLOn12

A few bugfixes and a performance optimization #46