Open airlied opened 4 years ago
Just wanted to add I think the SetKernelArg interface is a holdover from when you have a context and it should be more of a thing you enqueue on the cmdlist rather than a context state holding API.
Good find. There are a couple of directions we could consider:
Any additional feedback is welcome.
I'm trying to think like an explicit API here. One thing I feel is quite implicit is the hidden mapping between input arguments and what ends up in essentially a constant buffer that is built inside the driver. (I'm not aware of any hw which doesn't end up using something like push constants or a UBO for this).
In my mind that makes me lean to giving the app side enough information to build it's own constant buffer that gets filled out by it. However normally when you are recording command buffers you might have an ancillary data buffer alongside it, and you could possibly queue this information in there as an optimisation. In which case I'd add:
All calls below zeKernelSetAttribute zeKernelGetAttribute zeKernelSetArgumentValue zeKernelSetGroupSize act on that object? then pass that object to the AppendLaunchKernel APIs
I've also had some worse ideas:
If we think of it from what is the driver doing here pov, it's just filling out a buffer, do we want to associate the buffer with the kernel, or do we want to associate with the subsequent dispatch.
Like
On your list of options:
Seems like a bad idea. implicit locks are the enemy of the explicit API.
Yes not sure here either, you could have a CommandListBindKernel then have the APIs work on commandlist level, but again this sets off my implicit state in explicit API alarm bells.
One thing I feel is quite implicit is the hidden mapping between input arguments and what ends up in essentially a constant buffer that is built inside the driver.
Indeed. We need to consider this.
I'm not aware of any hw which doesn't end up using something like push constants or a UBO for this
Neither do I. Typically, some combination of the two; as determined by device-specific driver/compiler heuristics.
you might have an ancillary data buffer alongside it
It doesn't need to be in the same buffer that the application provides. And if the app provides a Host accessible buffer, then the driver can always copy the contents to whatever it wants for HW delivery.
1.
implicit locks are the enemy of the explicit API
Precisely.
2.
this sets off my implicit state in explicit API alarm bells
If we went this path then I'm wondering if we really need to change each argument individually and whether it can be a single call, like zeCommandListAppendKernelArgs( ze_command_list_handle_t, ze_kernel_handle_t, size_t size, void* args )
; that also disallows inheritance. Similarly, zeCommandListAppendKernelGroupSize
. This feels like the "chicken out" option.
3.
I suppose it depends on how attached the userbase is to the idea of having kernel entrypoints look and operate like C functions.
Very.
4.
Create another object binding kernels to command buffers
We originally defined a kernel argument object but decided it was too heavyweight for transient usage.
that would involve handing back some mapping between kernel argument and buffer offset for the app side to put things in the right places
I need to do some digging to see what we would be giving up.
unless you can ask for the value the driver is going to put in the buffer for the image
That would definitely be more explicit. Furthermore, we may have to expose the descriptor heaps also.
If we think of it from what is the driver doing here pov, it's just filling out a buffer, do we want to associate the buffer with the kernel, or do we want to associate with the subsequent dispatch.
The kernel just defines the layout. The arguments belong to the command list.
The spec doesn't seem to explicitly say this but kernel objects are currently defined should be generated from modules per command recording thread.
I wonder if it would be better to add a way to just record a kernel to a command list, set group size and inputs for and dispatch it. This seems more thread safe and would allow for sharing kernel objects across threads if all you wanted to do was record a command stream with different inputs or group sizes.
Otherwise you'd have to create a new kernel object per command stream, I suppose it depends on whether the API usage envisages kernel objects are being light to construct or something that will trigger a compiler backend pass.