halide / Halide

a language for fast, portable data-parallel computation
https://halide-lang.org
Other
5.89k stars 1.07k forks source link

Halide should offer first-class Complex number support #7456

Open steven-johnson opened 1 year ago

steven-johnson commented 1 year ago

A lot of math/ML-related workloads need support for complex numbers; there are ways to deal with this in Halide that aren't awful (see apps/fft for a good example), we should really offer something better "out of the box".

Thought 1: take the helpers in apps/fft/complex.h and move them into Halide itself. Fairly easy to do, but not as 100% transparent to use as it could be.

Thought 2: add Complex(32) and Complex(64) as new fundamental Types in Halide. Appealing and arguably cleaner, but has a number of corner cases that would have to be thought out (e.g. the many places that assume that each element of a buffer is a scalar value).

steven-johnson commented 1 year ago

Tagging @abadams and @zvookin for their thoughts.

abadams commented 1 year ago

A native complex type is similar to adding a packed RGB pixel type. You get horrible performance everywhere because it's not really compatible with vectorization. Complex values should probably be represented as Tuples, which is what apps/fft/complex.h does. I'm on the fence about whether the wrapper type in apps/fft/complex.h really pays for itself though. You need FuncT as well, which has to forward a lot of FuncRef operators.

The alternative is complex_mul/add/sub(Tuple, Tuple) free functions, which is ugly for different reasons.

spinicist commented 1 year ago

Hello - commenting here as my question in Gitter prompted @steven-johnson to post this. I know very little about Halide.

I'm looking at using Halide with my existing code which is based on Eigen. Eigen are intermittently working on vectorizing complex functions, and the consensus is that while it isn't easy, it is possible to do.

The other idea which comes up often is the struct-of-array approach - i.e. maintain separate real and imaginary arrays rather than packing complex elements. I understand that this is much more amenable to vectorization, but getting data into/out becomes messy. I don't know whether that approach would make sense with Halide?

abadams commented 1 year ago

Using Tuples is the struct-of-array approach - you get separate allocations for the real and imaginary components. A native complex type would be an array of structs. It's true that it's possible to vectorize, but it's probably never going to be as fast a struct-of-arrays.

spinicist commented 1 year ago

Oh that's cool - thanks for explaining!

steven-johnson commented 1 year ago

The high-order bit here is that this isn't the first time we've gotten questions about Complex math in Halide; at a minimum, we should have some documentation or guidance on it somewhere that is hopefully better than "look at apps/fft".

I haven't used the wrappers in apps/fft.h enough to have a strong opinion on whether they are a worthwhile abstraction vs. needless overhead. My gut says that something with some syntactic sugar would be an improvement over a naked Halide::Tuple. If we need FuncT (or something like it) to make the syntactic sugar sweet enough, well, in my mind that's an argument in favor of bringing it into core Halide IMHO.

antonysigma commented 1 year ago

As a (former) computational imaging researcher, complex types represents >90% of the Halide accelerated algorithms coded in the lab. We perform Zernike aberration retrieval, and digital phase retrieval from intensity images. I have been using the apps/fft/complex.h and ComplexExpr to serve as the syntactic sugar. This header file is sufficient for my use case.

Here, I will share the pain points of the complex types support in Halide.

Boundary condition support One feature I like about first-class support of complex types, is the ability to apply BoundaryCondition to ComplexFunc. Currently, I hand-roll my own function to implement things like constant_exterior and repeat_edge. We take advantage of BoundaryCondition in the Complex Fourier domain to reduce compute cost.

For example, constant_exterior zeropads the Fourier domain on demand, effectively interpolates the image in the space domain. Similarly, repeat_image in the Fourier domain interleaves the image pixels with zeros.

FFT of large sized, multi-layer image The apps/FFT/fft.h implementation is only practical for image size of 64x64 or less; anything larger than that will overwhelm typical PC RAMs, e.g. 16GB. If the FFT needs to be computed on the GPU, I would restructure the ComplexFunc to a Func with a mux(), then invoke cuFFT via define_extern. I am not sure if we can (or should) do the same with FFTW library for CPU-only schedules. And I think FFT support should be excluded from the core support for this reason.

Vectorization with interleave_vector isn't too bad in practice The suboptimal CPU vectorization "array of struct" isn't too bad in practice, considering the alternative (Matlab or Numpy or Armadillo) always compute and store data in the heap memory. With Halide, I can control how many stages can be fused, and whether the complex value should be stored in stack or registers, improving the overall speed. So, for my use cases, the benefits of "array of tuples" outweighs the costs.

spinicist commented 1 year ago

FFT of large sized, multi-layer image The apps/FFT/fft.h implementation is only practical for image size of 64x64 or less; anything larger than that will overwhelm typical PC RAMs, e.g. 16GB.

Am I reading this right that a 64x64 FFT will consume gigabytes of RAM in Halide? Because that ought to take 32 kilobytes.

antonysigma commented 1 year ago

Am I reading this right that a 64x64 FFT will consume gigabytes of RAM in Halide?

To clarify, >=32GB refers to the AOT generator peak memory consumption, compiled with Halide <= v2019.* withgcc compiler optimization -O1. I didn't know whether the Halide IR exhausted all memory, or the -e static_library,h,stmt_html is causing the problem. I recalled whenever I attempt to generate the FFT2 routine with size 256 x 256 x 50, the AOT memory fills up the RAM and swap space, slowing down, till it is automatically killed by the Linux OS.

Perhaps the gcc or libHalide.so has improved since year 2019. I will test the apps/fft/fft.h code again when I have spare time; likely measuring the peak memory consumption of the AOT with tcmalloc.

spinicist commented 1 year ago

Thank you for clarifying. That makes much, much more sense!

abadams commented 1 year ago

apps/fft generates a fully-unrolled 2D fft. It's designed for algorithms that want to take ffts of small tiles.