Closed samuelpmish closed 1 year ago
I haven't made a systematic effort to code the SSE/AVX/AVX-512 intrinsics. There are thousands of variants and I have other priorities right now. I have coded a lot of intrinsics already, but that's because they're in some other critical path, like in libstdc++ or boost or other common libraries. I don't have a date in mind on when these should be ready. There's nothing difficult about doing this... It's really just an issue of available manpower.
Does available manpower in this context mean just you, or is there a realistic way another person could help?
There are thousands of variants and I have other priorities right now.
That's totally understandable, thanks for the quick response!
There's nothing difficult about doing this... It's really just an issue of available manpower.
What's the best way for people to help improve circle
? I'd definitely be willing to help out, if possible.
Thanks for the interest. It's just me. No realistic way of other people helping at this point. I'll get around to SSE/VAX, and it won't take that long, but I'm working on a different feature right now that's taking 100% of my time.
It is possible to embed SIMT programming model in circle and mimic what ISPC has achieved?
It is possible to embed SIMT programming model in circle and mimic what ISPC has achieved?
I think it is. I looked into this a few months ago. I might return to it. It will require a lot of research; it's not just a coding problem. It will take some imagination to figure out the inter-function bindings that will allow us to treat normal functions as SIMT functions.
It will take some imagination to figure out the inter-function bindings that will allow us to treat normal functions as SIMT functions.
Hmm, What about treat the entry point function as what __global__
is in cuda and treat normal functions as __device__
? Just like cuda. The difference is that there is no physical device for CPU SIMT code, no seperate memory space thus no need to copy the data back and forth. I'd imagine reusing cuda frontend part and swapping the backend to ispc-like codegen is a viable option.
No, ISPC is totally different from CUDA. The compiler has to do its own lane mask for dynamic branching, and issues SIMD instructions with explicit masking instead of using scalar instructions. And when you call another function from inside a dynamic branch? The called function has to be passed the lane mask, and now it has to do masking. There's quite a lot to figure out. There's no ISPC-like LLVM target.
Another option is to leverage circle
's metaprogramming capabilities together with existing C++ libraries like enoki, which vectorize through use of container templates. In regular C++, the main downside of libraries like enoki is that they require the user to go into existing code and refactor things to use those special containers. e.g.
struct GPSCoord2f {
uint64_t time;
Array<float, 2> pos;
bool reliable;
};
becomes
template <typename Value>
struct GPSCoord2 {
uint64_array_t<Value> time;
Array<Value, 2> pos;
mask_t<Value> reliable;
};
and then operating on GPSCoord2< Packet<float, 8> >
, for example, will implicitly allow 8-wide vectorization.
circle
already has the ability to statically reflect on types and define new ones, meaning that this sort of refactoring could potentially be automated-- eliminating one of the biggest hurdles to adopting utility libraries like enoki.
However, the functions that use these types would also need to be refactored, and I'm not sure if circle
provides a way to
do metaprogramming on functions[^1] in the same way it does on types.
[^1]: But it would be cool if it did! A lot of challenging problems (like automatic differentiation) are just metaprogramming applied to functions.
The issue with vector types and masks is actually performing control flow. It's always really ugly when you use a library. We should aim for a SIMT-style system that feels more like CUDA or shaders, where the lane mask is implicit. I don't think metaprogramming is The Way. Must make it a first-class language feature.
Simd support is actually the only thing preventing me from using circle to compile SDL apps right know, I would definitely use it more often.
I'm actively working on it.
Hi, I saw that build 143 introduced support for SIMD instructions, and it successfully resolves the original issue topic 🎉.
I've played with it a little bit more and run into another small snag where circle will successfully instantiate most of the operator+
templates in the example below, but fail on one of them.
error reproducer in circle: https://godbolt.org/z/hEe7T9Kb5 same code compiled w/ gcc: https://godbolt.org/z/P7qWfMczf
I've played with it a little bit more and run into another small snag where circle will successfully instantiate most of the
operator+
templates in the example below, but fail on one of them.
Thanks for the report. If you're curious about the nature of the bug, it has to do with how return types are formed with the SSEUP eightword type from the SystemV ABI: https://github.com/hjl-tools/x86-psABI/wiki/x86-64-psABI-1.0.pdf
I think there is some register coalescing I'm not doing correctly. Looking into it.
The SSEUP parameter passing bug is fixed and the fix will be in the next build.
Hi, I'm interested in using circle's metaprogramming tools to extend an existing codebase that makes use of SIMD intrinsics, but it seems to be failing to compile. A small example here (code below)
illustrates the problem: errors to the tune of
Meanwhile, the 128-bit intrinsics seem to compile successfully:
With gcc/clang, flags (e.g.
-mavx
) are sometimes needed to enable the larger intrinsics, I was wondering if there is something similar that needs to be passed tocircle
or if these are currently unsupported.These examples were tested against build_136 on Ubuntu 20.04, as well as compiler explorer's "Latest" circle option (unsure what version that actually is).