Question: What is the state of SIMD support in circle?

samuelpmish commented 3 years ago

Hi, I'm interested in using circle's metaprogramming tools to extend an existing codebase that makes use of SIMD intrinsics, but it seems to be failing to compile. A small example here (code below)

#include <immintrin.h>

struct vec4 {
    double values[4];
    double & operator()(int i) { return values[i]; }
    const double & operator()(int i) const { return values[i]; }
};

vec4 plus(const vec4 & x, const vec4 & y) {
    return vec4{x(0) + y(0), x(1) + y(1), x(2) + y(2), x(3) + y(3)};
}

vec4 operator+(const vec4 & x, const vec4 & y) {
    __m256d _x = _mm256_loadu_pd(x.values);
    __m256d _y = _mm256_loadu_pd(y.values);
    vec4 sum;
    _mm256_storeu_pd(sum.values, _mm256_add_pd(_x, _y));
    return sum;
}

illustrates the problem: errors to the tune of

vector_size 64 is unsupported
typedef char __v64qi __attribute__ ((__vector_size__ (64)));

Meanwhile, the 128-bit intrinsics seem to compile successfully:

#include <xmmintrin.h>

struct vec2 {
    double values[2];
    double & operator()(int i) { return values[i]; }
    const double & operator()(int i) const { return values[i]; }
};

vec2 plus(const vec2 & x, const vec2 & y) {
    return vec2{x(0) + y(0), x(1) + y(1)};
}

vec2 operator+(const vec2 & x, const vec2 & y) {
    __m128d _x = _mm_loadu_pd(x.values);
    __m128d _y = _mm_loadu_pd(y.values);
    vec2 sum;
    _mm_storeu_pd(sum.values, _mm_add_pd(_x, _y));
    return sum;
}

With gcc/clang, flags (e.g. -mavx) are sometimes needed to enable the larger intrinsics, I was wondering if there is something similar that needs to be passed to circle or if these are currently unsupported.

These examples were tested against build_136 on Ubuntu 20.04, as well as compiler explorer's "Latest" circle option (unsure what version that actually is).

mrzawmyowin commented 3 years ago

https://github.com/seanbaxter/circle/issues/49#issue-1016641379

seanbaxter commented 3 years ago

I haven't made a systematic effort to code the SSE/AVX/AVX-512 intrinsics. There are thousands of variants and I have other priorities right now. I have coded a lot of intrinsics already, but that's because they're in some other critical path, like in libstdc++ or boost or other common libraries. I don't have a date in mind on when these should be ready. There's nothing difficult about doing this... It's really just an issue of available manpower.

csp256 commented 3 years ago

Does available manpower in this context mean just you, or is there a realistic way another person could help?

samuelpmish commented 3 years ago

There are thousands of variants and I have other priorities right now.

That's totally understandable, thanks for the quick response!

There's nothing difficult about doing this... It's really just an issue of available manpower.

What's the best way for people to help improve circle? I'd definitely be willing to help out, if possible.

seanbaxter commented 3 years ago

Thanks for the interest. It's just me. No realistic way of other people helping at this point. I'll get around to SSE/VAX, and it won't take that long, but I'm working on a different feature right now that's taking 100% of my time.

cloudhan commented 3 years ago

It is possible to embed SIMT programming model in circle and mimic what ISPC has achieved?

seanbaxter commented 3 years ago

It is possible to embed SIMT programming model in circle and mimic what ISPC has achieved?

I think it is. I looked into this a few months ago. I might return to it. It will require a lot of research; it's not just a coding problem. It will take some imagination to figure out the inter-function bindings that will allow us to treat normal functions as SIMT functions.

cloudhan commented 3 years ago

It will take some imagination to figure out the inter-function bindings that will allow us to treat normal functions as SIMT functions.

Hmm, What about treat the entry point function as what __global__ is in cuda and treat normal functions as __device__? Just like cuda. The difference is that there is no physical device for CPU SIMT code, no seperate memory space thus no need to copy the data back and forth. I'd imagine reusing cuda frontend part and swapping the backend to ispc-like codegen is a viable option.

seanbaxter commented 3 years ago

No, ISPC is totally different from CUDA. The compiler has to do its own lane mask for dynamic branching, and issues SIMD instructions with explicit masking instead of using scalar instructions. And when you call another function from inside a dynamic branch? The called function has to be passed the lane mask, and now it has to do masking. There's quite a lot to figure out. There's no ISPC-like LLVM target.

samuelpmish commented 3 years ago

Another option is to leverage circle's metaprogramming capabilities together with existing C++ libraries like enoki, which vectorize through use of container templates. In regular C++, the main downside of libraries like enoki is that they require the user to go into existing code and refactor things to use those special containers. e.g.

struct GPSCoord2f {
    uint64_t time;
    Array<float, 2> pos;
    bool reliable;
};

becomes

template <typename Value> 
struct GPSCoord2 {
    uint64_array_t<Value> time;
    Array<Value, 2> pos;
    mask_t<Value> reliable;
};

and then operating on GPSCoord2< Packet<float, 8> >, for example, will implicitly allow 8-wide vectorization.

circle already has the ability to statically reflect on types and define new ones, meaning that this sort of refactoring could potentially be automated-- eliminating one of the biggest hurdles to adopting utility libraries like enoki.

However, the functions that use these types would also need to be refactored, and I'm not sure if circle provides a way to do metaprogramming on functions[^1] in the same way it does on types.

[^1]: But it would be cool if it did! A lot of challenging problems (like automatic differentiation) are just metaprogramming applied to functions.

seanbaxter commented 3 years ago

The issue with vector types and masks is actually performing control flow. It's always really ugly when you use a library. We should aim for a SIMT-style system that feels more like CUDA or shaders, where the lane mask is implicit. I don't think metaprogramming is The Way. Must make it a first-class language feature.

ghost commented 2 years ago

Simd support is actually the only thing preventing me from using circle to compile SDL apps right know, I would definitely use it more often.

seanbaxter commented 2 years ago

I'm actively working on it.

samuelpmish commented 2 years ago

Hi, I saw that build 143 introduced support for SIMD instructions, and it successfully resolves the original issue topic 🎉.

I've played with it a little bit more and run into another small snag where circle will successfully instantiate most of the operator+ templates in the example below, but fail on one of them.

error reproducer in circle: https://godbolt.org/z/hEe7T9Kb5 same code compiled w/ gcc: https://godbolt.org/z/P7qWfMczf

seanbaxter commented 2 years ago

I've played with it a little bit more and run into another small snag where circle will successfully instantiate most of the operator+ templates in the example below, but fail on one of them.

Thanks for the report. If you're curious about the nature of the bug, it has to do with how return types are formed with the SSEUP eightword type from the SystemV ABI: https://github.com/hjl-tools/x86-psABI/wiki/x86-64-psABI-1.0.pdf

I think there is some register coalescing I'm not doing correctly. Looking into it.

seanbaxter commented 2 years ago

The SSEUP parameter passing bug is fixed and the fix will be in the next build.

seanbaxter / circle

Question: What is the state of SIMD support in circle? #49