Feature Suggestion: Use RenderingDevice compute shader API to do large FFTs on GPU for performance

tessarakkt commented 1 year ago

Implementing the (I)FFT in a compute shader could significantly help to speed up the process time for larger FFTs.

As part of my ocean system addon, I have a Stockham IFFT implemented in a compute shader that is able to perform IFFT on multiple 2D 256x256 textures each frame at 100+FPS (if the vertex count of the water surface doesn't put too much load on the GPU). I could help in integrating that existing code into your project if you were interested.

This is all done with GDScript and GLSL shaders, so no C++, C#, or custom engine builds are needed.

These are the files which contain relevant code:

addons/tessarakkt.oceanfft/components/Ocean3D.gd
- See section titled by comment "Compile & Initialize FFT Shaders" in _ready() for initialization code
- See section titled by comment "Execute Horizontal FFT Shader Cascades" in _process() for execution code
addons/tessarakkt.oceanfft/shaders/OceanFFTHorizontal.glsl
addons/tessarakkt.oceanfft/shaders/OceanFFTVertical.glsl

tavurth commented 1 year ago

Very cool idea, I would generally like to keep the current version for compatibility as my Macbook is still too low end to really make use of vulcan compute shaders 😢

However I see that you have a MIT license on there, how would you feel about my inclusion of those files with some editing in this library?

I probably won't have time in the next few weeks but it's a cool thing to put on my TODO list 👍

Alternately if you would like to create a PR you are more than welcome!

tessarakkt commented 1 year ago

Yeah feel free to include the code from my project as needed. I may do some more work on this over the weekend as well.

I agree it would be best to leave the existing CPU implementation largely untouched, and implement the compute shader version alongside it. It will still provide a viable option for smaller data sets where GPU accelleration wouldn't yield as much benefit, and would be easier to use than the GPU version.

I drafted the following as my thoughts on the API end of the GPU implementation. I opted against implementing it within a singleton, as the 2D FFTs depend on an internal ping pong buffer to handle the recursive nature of the FFT without causing bottlenecks. If multiple threads were to use the same internal state at the same time, that would not be a fun time. That being said, I think multiple threads trying to use compute shaders simultaneously is not really useful, as especially with larger FFTs a single CPU thread could easily start bottlenecking the GPU.

The GPU implementation is intentionally out-of-place rather than in-place as in the CPU implementation. In-place FFT is optimized for CPU cache alignment, and is good on a CPU, but due to the highly parallel nature of GPUs, in-place modification of textures can cause thread serialization.

The non-RID based functions are intended to be useful for most people. The RID based functions are intended to be used internally by the non-RID ones, as well as by more advanced users who will be passing input from, or output to, other shaders within the same RenderingDevice instance.

Functions

`GPUFFT.new(new_rd: RenderingDevice, new_resolution: int, new_format: RenderingDevice.DataFormat)`

Initializes a new GPUFFT instance using the provided RenderingDevice as a backend. This must be a local device as provided by RenderingServer.create_local_rendering_device(), the engines main device returned by RenderingServer.get_rendering_device() will not allow compute shaders to be executed. Initializes the FFT shaders, shader piplines, and ping pong buffers buffer that are resolution squared in size, which are used internally as during shader execution.

`gpufft_instance.fft(input: Array[float]) -> Array[Complex]`

`gpufft_instance.ifft(input: Array[Complex]) -> Array[Complex]`

Perform a GPU accelerated 1D (I)FFT on the provided data and return a new array containing the result. As this marshalls the input data from CPU to GPU, and then the result back from GPU to CPU, this can have a significant impact on performance if overused.

`gpufft_instance.fft_2d_image(input: Image) -> void`

`gpufft_instance.ifft_2d_image(real_input: Image, imaginary_input: Image) -> void`

Perform a GPU accelerated 2D (I)FFT on the provided data. The result is stored in VRAM and must be retrieved with get_real_image() or get_imaginary_image(). As this marshalls the input data from CPU to GPU this can have a significant impact on performance if overused.

`gpufft_instance.get_real_image() -> Image`

Get an Image that contains the real component of the most recent (I)FFT operation. 2D (I)FFTs will use the entire size of the image. 1D FFTs will use the top row only. As this marshalls the result data back from GPU to CPU this can have a significant impact on performance if overused.

`gpufft_instance.get_imaginary_image() -> Image`

Get an Image that contains the imaginary component of the most recent (I)FFT operation. 2D (I)FFTs will use the entire size of the image. 1D FFTs will use the top row only. As this marshalls the result data back from GPU to CPU this can have a significant impact on performance if overused.

`gpufft_instance.fft_2d_rid(input: RID) -> void`

`gpufft_instance.ifft_2d_rid(input: RID) -> void`

Perform an (I)FFT on a texture, first on each row of pixels, then on each column. The provided RID must point to a texture that is valid within the context of this GPUFFT instances RenderingDevice, the texture must be sized as a square power of 2 (64x64, 128x128, 256x256, etc.), and that size must match the GPUFFT instances resolution property. Does not marshall data between CPU and GPU.

`gpufft_instance.fft_2d_rid_horizontal(input: RID) -> void`

`gpufft_instance.ifft_2d_rid_horizontal(input: RID) -> void`

Perform an (I)FFT on each row of pixels in a texture. The provided RID must point to a texture that is valid within the context of this GPUFFT instances RenderingDevice, the texture must be sized as a square power of 2 (64x64, 128x128, 256x256, etc.), and that size must match the GPUFFT instances resolution property. Does not marshall data between CPU and GPU.

`gpufft_instance.fft_2d_rid_vertical(input: RID) -> void`

`gpufft_instance.ifft_2d_rid_vertical(input: RID) -> void`

Perform an (I)FFT on each column of pixels in a texture. The provided RID must point to a texture that is valid within the context of this GPUFFT instances RenderingDevice, the texture must be sized as a square power of 2 (64x64, 128x128, 256x256, etc.), and that size must match the GPUFFT instances resolution property. Does not marshall data between CPU and GPU.

`gpufft_instance.get_real_rid() -> RID`

Return an RID that points to a texture valid within the context of this GPUFFT instances RenderingDevice which contains the real component of the most recent (I)FFT operation. 2D (I)FFTs will use the entire size of the image. 1D FFTs will use the top row only. Does not marshall data between CPU and GPU.

`gpufft_instance.get_imaginary_rid() -> RID`

Return an RID that points to a texture valid within the context of this GPUFFT instances RenderingDevice which contains the imaginary component of the most recent (I)FFT operation. 2D (I)FFTs will use the entire size of the image. 1D FFTs will use the top row only. Does not marshall data between CPU and GPU.

tavurth / godot-fft