oneapi-src / level-zero

oneAPI Level Zero Specification Headers and Loader
https://spec.oneapi.com/versions/latest/elements/l0/source/index.html
MIT License
218 stars 96 forks source link

kernel arguments and group size thread safety recommendations #21

Open airlied opened 4 years ago

airlied commented 4 years ago

The spec doesn't seem to explicitly say this but kernel objects are currently defined should be generated from modules per command recording thread.

I wonder if it would be better to add a way to just record a kernel to a command list, set group size and inputs for and dispatch it. This seems more thread safe and would allow for sharing kernel objects across threads if all you wanted to do was record a command stream with different inputs or group sizes.

Otherwise you'd have to create a new kernel object per command stream, I suppose it depends on whether the API usage envisages kernel objects are being light to construct or something that will trigger a compiler backend pass.

airlied commented 4 years ago

Just wanted to add I think the SetKernelArg interface is a holdover from when you have a context and it should be more of a thing you enqueue on the cmdlist rather than a context state holding API.

gatorfax commented 4 years ago

Good find. There are a couple of directions we could consider:

  1. SetKernelArgs on the current active context - need implicit locks to be thread-safe
  2. SetKernelArgs on the command list - need to deal with inheritance
  3. Introduce something analogous to descriptor sets, layouts, pools - overkill?

Any additional feedback is welcome.

airlied commented 4 years ago

I'm trying to think like an explicit API here. One thing I feel is quite implicit is the hidden mapping between input arguments and what ends up in essentially a constant buffer that is built inside the driver. (I'm not aware of any hw which doesn't end up using something like push constants or a UBO for this).

In my mind that makes me lean to giving the app side enough information to build it's own constant buffer that gets filled out by it. However normally when you are recording command buffers you might have an ancillary data buffer alongside it, and you could possibly queue this information in there as an optimisation. In which case I'd add:

  1. Create another object binding kernels to command buffers ze_kernel_exec_handle_t. create it from a kernel object

All calls below zeKernelSetAttribute zeKernelGetAttribute zeKernelSetArgumentValue zeKernelSetGroupSize act on that object? then pass that object to the AppendLaunchKernel APIs

I've also had some worse ideas:

  1. Be very explicit and let the app side build the kernel argument buffer, though I think that would involve handing back some mapping between kernel argument and buffer offset for the app side to put things in the right places, and maybe for opaque things like images that can't really work unless you can ask for the value the driver is going to put in the buffer for the image, rather than storing the image handle.

If we think of it from what is the driver doing here pov, it's just filling out a buffer, do we want to associate the buffer with the kernel, or do we want to associate with the subsequent dispatch.

Like

airlied commented 4 years ago

On your list of options:

  1. SetKernelArgs on the current active context - need implicit locks to be thread-safe

Seems like a bad idea. implicit locks are the enemy of the explicit API.

  1. SetKernelArgs on the command list - need to deal with inheritance

Yes not sure here either, you could have a CommandListBindKernel then have the APIs work on commandlist level, but again this sets off my implicit state in explicit API alarm bells.

  1. Introduce something analogous to descriptor sets, layouts, pools - overkill? I started thinking this might be useful, but it's a big thought process how to map something like descriptor sets/layouts/pools to the CL/Lvl0 kernel argument passing semantics. Since essentially the launch state is a UBO that just has pointers to all the arguments, and we have args like pointers and constant values that don't really map into the descriptor sets world. I suppose it depends on how attached the userbase is to the idea of having kernel entrypoints look and operate like C functions.
gatorfax commented 4 years ago

One thing I feel is quite implicit is the hidden mapping between input arguments and what ends up in essentially a constant buffer that is built inside the driver.

Indeed. We need to consider this.

I'm not aware of any hw which doesn't end up using something like push constants or a UBO for this

Neither do I. Typically, some combination of the two; as determined by device-specific driver/compiler heuristics.

you might have an ancillary data buffer alongside it

It doesn't need to be in the same buffer that the application provides. And if the app provides a Host accessible buffer, then the driver can always copy the contents to whatever it wants for HW delivery.

1.

implicit locks are the enemy of the explicit API

Precisely.

2.

this sets off my implicit state in explicit API alarm bells

If we went this path then I'm wondering if we really need to change each argument individually and whether it can be a single call, like zeCommandListAppendKernelArgs( ze_command_list_handle_t, ze_kernel_handle_t, size_t size, void* args ); that also disallows inheritance. Similarly, zeCommandListAppendKernelGroupSize. This feels like the "chicken out" option.

3.

I suppose it depends on how attached the userbase is to the idea of having kernel entrypoints look and operate like C functions.

Very.

4.

Create another object binding kernels to command buffers

We originally defined a kernel argument object but decided it was too heavyweight for transient usage.

  1. that would involve handing back some mapping between kernel argument and buffer offset for the app side to put things in the right places

I need to do some digging to see what we would be giving up.

unless you can ask for the value the driver is going to put in the buffer for the image

That would definitely be more explicit. Furthermore, we may have to expose the descriptor heaps also.

If we think of it from what is the driver doing here pov, it's just filling out a buffer, do we want to associate the buffer with the kernel, or do we want to associate with the subsequent dispatch.

The kernel just defines the layout. The arguments belong to the command list.