Comment: check out the safe_arch crate

Lokathor commented 4 years ago

Hey so a person directed me to your post, which is related to what ive been working on recently. I thought I'd leave a comment about some additional SIMD stuff you might care to look at and use.

My recent work has been a crate called safe_arch, which has the goal of just exposing as many parts of core::arch as possible as safe functions, and in as direct a way as possible. It doesn't do variations based on runtime feature detection, just `cfg| based availability.

I was previously developing a crate called wide, but then decided to pull out the inner "safe intrinsics" part to be its own crate. Once safe_arch is complete enough I'll transition the next version of wide over to being safe_arch based. The wide crate is closer to what you seem to have been working on in the blog post, with types for f32x4 and i32x4 and so on, and each operation is cross platform to just be "the best version of this method you can get on this platform with these compile settings", some of which involves trying to trick the LLVM auto-vectorizer to do the right thing.

Anyway, just thought you'd want the heads up about more crates to look at.

vorner commented 4 years ago

Hello

I've taken the liberty to transfer the issue to the library repo instead of the blog repo, where it seems to be more appropriate place.

I think I've seen the safe_arch somewhere already. It seems to be a nice thing if one wants to go the intrinsic-explicit way. But I think it might be incompatible with runtime detection?

When I looked, wide looked somewhat incomplete and abandoned and I assumed it is mostly dead. Now I see it was just a wrong impression. If you see a way to reuse some effort, I'm not against it ‒ though I probably expect slipstream to stay using just the arrays and nothing explicit ‒ I don't know what happens if you use eg. 2 SSE intrinsics and then slap #[target_feature(enable = "avx2")]. I'd expect that to still do 2 SSE instructions, but the array approach might generate one AVX instruction.

More generally, it seems the area about SIMD is beginning to stir somehow, which is good, but it's unclear what'll come out of it eventually.

Lokathor commented 4 years ago

Good to have a home for this! I was linked to your blog without being told of the reddit post so there wasn't any obvious place to put it, but the slipstream repo is fine.

You might have heard of safe_arch before because I posted it on r/rust about a month ago. As written, yes, safe_arch is incompatible with runtime detection. It only uses compile time cfg analysis. However, in terms of the safety analysis and docs and unified naming, things aren't really different between runtime detection and compile time detection. So feel free to dig through safe_arch for ideas, or even convert it all over to using a runtime feature token system using some script.

For wide, good point that it seems abandoned. That's my fault a bit. I updated the readme just now so that it says what's going on there. Usually I only end up promoting my libs directly to others on Discord so I don't always remember to write things down in the repo that maybe need to be written down.

I think the biggest opportunity for sharing work between crates such as wide and slipstream (and any others?) is not in the basic ops part of things. It's in the advanced formulas part of things. Once you get up past individual add and mul operations and start trying to do methods for sine and cosine and stuff, it's probably best to make a reference implementation for how to do an operation in a lane count agnostic way. Perhaps just with pseudo-code rather than specific Rust code since math is pretty pseudo-code friendly in general and this would be targeted at cross-crate usage. Then people can convert the pseudo-code to their particular crate's conventions for the SIMD types they support (4-lane, 8-lane, etc, even 2-lane versions if you want to support that on ARM).

In terms of long term SIMD progress, I think that the biggest blocker is the lack of ARM/Neon support. However, that's a huge project and so I totally understand that no one has just magically dumped all 4000 Neon intrinsics out of the blue.

vorner commented 4 years ago

Maybe I'm a bit confused. Originally, I did want to use the intrinsics, but the current code and my current plan is to not touch any intrinsics at all, in any way (except in benchmarks, that is).

That way I don't really have to worry about safety (well, not the intrinsic related one; I still have some pointer casting there and such).

It also means I don't really care about number of lanes ‒ the autovectorizer should take care of that. Of the Neon stuff too.

I'll probably try pushing to have the runtime-detection and #[target_feature] neon support stabilized, but that doesn't need that 4k intrinsics.

Lokathor commented 4 years ago

Ah.

Well even if you're not touching any intrinsics you'd still want to provide the more complex math functions like sine and cosine and so on.

vorner commented 4 years ago

Yes, of course (eventually). But I expect to do it the same way I do everything else ‒ call the sin on each lane in a for cycle and construct a new vector from the vector. I hope this'll get auto-vectorized like the add and such.

Do you think there's a reason why this wouldn't work?

Or do you mean something else than lane-wise sin?

Lokathor commented 4 years ago

Oh, I see what you wanted. Yeah that won't work.

Basically any function that wont be inlined won't even be eligible for the auto-vectorizer to look at it.

There would be some mild amount of instruction level parallelism but it wouldn't be done in SIMD.

vorner commented 4 years ago

Basically any function that wont be inlined won't even be eligible for the auto-vectorizer to look at it.

That shouldn't be a problem. The primitive types' sin and similar functions are marked as #[inline], so rustc should try to inline them. Similarly, I usually mark these „compound“ functions as inline, because I hope they'll turn into a single vector instruction and it'd get inlined. That works quite well for all the bits I've tried already. It is a bit important for propagation of the #[target_feature] attributes.

I'll just have to come up with a nice example or benchmark (or, do you have one?) and try it out.

Lokathor commented 4 years ago

Well, inline is not "always inline this" it's actually "allow cross-crate inlining and also reduce the threshold to inline". Anything big enough won't end up getting inlined anyway, and most of libm drifts towards the bigger end of things.

There's also inline(always), which really aggressively says to inline the thing, but even then it's technically a hint, and libm doesn't use inline(always) anyway.

Secondly, anything with if branching in it will not get turned into vectorized instructions by the compiler. You have to rearrange the code manually so that you're always executing both paths and then blending the lanes at the end.

If you throw it into the godbolt compiler explorer it'll be fairly plain that cos isn't being converted to vector form, you'll just get four calls to cos in a row as you iterate the [f32;4] array.

vorner commented 4 years ago

Seems I've again learned something. I've assumed that the x86_64 assembler has special instructions for sin and also for vectorized sin and indeed, then I'd expect the compiler would change 4 consecutive sin instructions into some kind of vsin.

But digging through some sources and everything, it really seems not to be the case, that all these are built from more primitive operations in software. Then your comments about sharing the code to implement them makes much more sense than before. I guess you're right, then, that sharing such code would be useful, because a glimpse to some random implementation doesn't seem to be a canonical example of „easy“ 😇. It also makes sense that this thing doesn't get auto-vectorized ‒ but doing these primitive operations on the vector types should allow it to auto-vectorize all the additions and multiplications and such.

So thanks for the enlightenment. I guess implementing trigonometric functions will wait a bit longer.

vorner / slipstream

Comment: check out the safe_arch crate #2