halide / Halide

a language for fast, portable data-parallel computation
https://halide-lang.org
Other
5.85k stars 1.07k forks source link

GPU runtimes should have more buffer information when allocating textures #1059

Open bleibig opened 8 years ago

bleibig commented 8 years ago

Right now, the OpenCL + Textures runtime (in the gpu_texture_api_variants branch) uses potentially inaccurate values for image formats and image descriptors when allocating images on the CL device. For instance, for the channel type it uses the respective unsigned int type for the elem_size. This can improve with the better type information in the new halide_buffer_t, but it still won’t be enough information to specify the full range of OpenCL channel types. Specifically, when we have sampling (#1021), we will want to use the CL_SNORM_x/CL_UNORM_x channel types for the images when doing linear filtering.

The runtime is also using CL_R as the channel order, which means read_image and write_image calls will only load and store one pixel at a time. If the image is in packed RGBA format with the color dimension vectorized, the runtime should be able to set the channel order to CL_RGBA with the code generator utilizing read_image/write_image calls that read and store 4 values at a time. But the runtime does not have enough information to know if it’s safe to create an image with CL_RGBA channel order instead of a CL_R. Theoretically other channel orders like CL_RG and CL_RGB could be supported for channel dimension extents of 2 or 3.

The general issue is that the compiler needs to somehow communicate with the runtime this extra info on how to create images in halide_opencl_textures_device_malloc. Possible solutions could be to have this data be part of the new halide_buffer_t, or passed in as an extra parameter to halide_device_malloc.

This issue has broader implications than just the OpenCL runtime in that it can probably help other GPU runtimes such as OpenGL as well (see #118, #371, #801).

zvookin commented 8 years ago

I'd like to avoid extending Halide's buffer_t API for this as I feel all the extra information is fairly specific to the backend GPU API/language being used. There are some concepts which could be abstracted and shared between the different GPU realms, but it seems more productive for developers to access the features specific to one GPU API using the names and concepts of that API. Thus for access outside of Halide, I'd prefer people to use the wrap/unwrap mechanism and to make allocations and access the texture information using the (e.g.) OpenCL API to do so.

For textures managed entirely internally by Halide, the allocation can be done using a direct call into the GPU API specific runtime. This is entirely between the Halide CodeGenDev* backend for that GPU API and its runtime code so we do not need to agree on a specific way to pass the extra information.

The one place Halide will need to surface such things is in storage scheduling directives. We will also of course need a way to carry the information internally in the IR. I'd like to drive the providing of scheduling directives for this from use cases as opposed to proactively designing for covering many possibilities. It feels like there is a fair bit of potential to add a lot of complexity and I'd only want to do that if there is a lot of performance to be gained. (Some of the more interesting possibilities are things like computing to and from vertex buffers and such, rather than just managing texture layouts.)

On Mon, Mar 7, 2016 at 10:43 AM, Brian Leibig notifications@github.com wrote:

Right now, the OpenCL + Textures runtime (in the gpu_texture_api_variants branch) uses potentially inaccurate values for image formats and image descriptors when allocating images on the CL device. For instance, for the channel type it uses the respective unsigned int type for the elem_size. This can improve with the better type information in the new halide_buffer_t, but it still won’t be enough information to specify the full range of OpenCL channel types. Specifically, when we have sampling (

1021 https://github.com/halide/Halide/issues/1021), we will want to

use the CL_SNORM_x/CL_UNORM_x channel types for the images when doing linear filtering.

The runtime is also using CL_R as the channel order, which means read_image and write_image calls will only load and store one pixel at a time. If the image is in packed RGBA format with the color dimension vectorized, the runtime should be able to set the channel order to CL_RGBA with the code generator utilizing read_image/write_image calls that read and store 4 values at a time. But the runtime does not have enough information to know if it’s safe to create an image with CL_RGBA channel order instead of a CL_R. Theoretically other channel orders like CL_RG and CL_RGB could be supported for channel dimension extents of 2 or 3.

The general issue is that the compiler needs to somehow communicate with the runtime this extra info on how to create images in halide_opencl_textures_device_malloc. Possible solutions could be to have this data be part of the new halide_buffer_t, or passed in as an extra parameter to halide_device_malloc.

This issue has broader implications than just the OpenCL runtime in that it can probably help other GPU runtimes such as OpenGL as well (see #118 https://github.com/halide/Halide/issues/118, #371 https://github.com/halide/Halide/issues/371, #801 https://github.com/halide/Halide/issues/801).

— Reply to this email directly or view it on GitHub https://github.com/halide/Halide/issues/1059.

bleibig commented 8 years ago

I'm not sure what you mean regarding getting this to work with internal allocations. Right now, the allocations are done with halide_device_malloc or halide_copy_to_device calls inserted by InjectHostDevBufferCopies.cpp. How do you see CodeGen_OpenCL_Dev passing the extra information in, given that it's mainly responsible for generating the kernel source code?

bleibig commented 8 years ago

An idea I have to solve this is to implement halide_device_interface in a more object-oriented way: have it be an abstract interface type with pure virtual member functions rather than a bunch of function pointers. Each GPU API will then subclass it, implement the device interface functions, and define extra state (e.g. type info) as needed, which all device interface functions would then have access to. This means calls to halide_*_device_interface() (in e.g. HalideRuntimeOpenCL.h) can take that extra state as arguments and create a new halide_device_interface instance for those parameters, rather than just returning a singleton object.

This would probably be a breaking change, so I would like feedback on if this is worth doing, or if we should try something else.

zvookin commented 8 years ago

Currently the runtime does not depend on C++ method dispatch. This varies from compiler to compiler in a way that would require quite a few more variants of the runtime. While I've gradually been adding more C++ internals knowledge to Halide, I'm not sure buying a dependency on vtable layout is worth doing. Vtables are also notoriously unstable. Any method addition introduces the possibility of binary incompatibility.

However any issue vtable use can solve, the table of function pointers can solve as well. If you want to add extra information the device interface, you are free to do so. The problem is that device interfaces are expected to be effectively static. That is they are not freed and are not ref counted in any way. Moreover the pointer is used as an identifier for the interface so you can't add any information without creating a completely distinct interface from the point of view of Halide.

Really, I don't see much use in trying to put a generic interface between the Halide compiler and functionality that is extremely specific to a given GPU interface. In that case, the support is inherently going to be specific to that API and one may as well design a mechanism that works well for the case. For functionality that is sufficiently generic, we can just extend the device interface API to handle the new functionality.

-Z-

On Tue, Apr 26, 2016 at 3:36 PM, Brian Leibig notifications@github.com wrote:

An idea I have to solve this is to implement halide_deviceinterface in a more object-oriented way: have it be an abstract interface type with pure virtual member functions rather than a bunch of function pointers. Each GPU API will then subclass it, implement the device interface functions, and define extra state (e.g. type info) as needed, which all device interface functions would then have access to. This means calls to halide*_device_interface() (in e.g. HalideRuntimeOpenCL.h) can take that extra state as arguments and create a new halide_device_interface instance for those parameters, rather than just returning a singleton object.

This would probably be a breaking change, so I would like feedback on if this is worth doing, or if we should try something else.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/halide/Halide/issues/1059#issuecomment-214908433

bleibig commented 8 years ago

I understand the issues with using vtables in the runtime. I've been trying to prototype a solution, but whenever runtime code was being compiled for OS X, I was getting LLVM ERROR: MachO doesn't support COMDATs, '...' cannot be lowered.. I'm not sure what's causing that, but I think it might be some compiler flag for the runtime code.

What I was aiming for with that idea is that every device allocation in OpenCL has a possibly different configuration for the image format, the buffer_t does not have enough information to infer that configuration, and that information is needed for multiple device_interface functions (device_malloc, copy_to_device, and copy_to_host) to ensure they pass the correct values to the OpenCL calls. So every device buffer would have its own device_interface object that would be able to share its extra state to all of the necessary device_interface object over the lifetime of that buffer. Other GPU runtimes that do not have a use for extra state wouldn't need to change much, there could still just be a single global object that is used.

Allocating images manually and using halide_opencl_wrap_cl_mem can work, but that's only for input/output buffers that the user directly has access to when doing AOT. However I don't see how that will work for intermediate buffers in a pipeline, or when using JIT. The compiler is able to figure out all of the details about how to allocate and describe CL images pretty easily through an IRVisitor pass. The problem is just communicating that extra info to the runtime. This is why I think some refactoring of the runtime might be necessary.