vorner / slipstream

Nudging the compiler to auto-vectorize things
Apache License 2.0
71 stars 4 forks source link

comment on SIMD and scalable vectors #1

Open programmerjake opened 4 years ago

programmerjake commented 4 years ago

One of the other SIMD instruction sets that you may want to consider is RISC-V's vector extension, because it takes a totally different approach than most other common ISAs: it uses scalable vectors -- basically the instructions are designed to operate on vectors where the element count can vary at runtime (limited to a processor-specific max length), so you could use a f32x47 or a f32x2 or a f32x10 all adjustable at runtime. The idea is that your code would automatically slice really long vectors into chunks that are small enough to fit in the CPU's registers, where the program doesn't have to be recompiled to take advantage of a new CPU with bigger registers.

ARM has a similar scheme called SVE except that the length is fixed for each CPU as opposed to dynamically adjustable.

I'm helping build a similarly dynamically adjustable OpenPower (PowerPC) processor as part of the open-source Libre-SOC project, where we are building a whole chip with a processor that is a hybrid CPU and GPU (as well as lots of other neat stuff). We have funding from NLNet and are creating some initial 180nm test chips in October 2020 and selling 28nm chips later.

vorner commented 4 years ago

Hello

I've taken the liberty of moving this issue to the library repo instead of the blog repo, it seems a more appropriate place.

It's an interesting issue, because what I aim for is writing code once, in a portable way and it is somewhat harder to do if some CPUs do it completely differently. However, I wonder what would happen on these CPUs if I simply use some large vector types ‒ eg. f64x16. Would the auto-vectorizer be able to simply create an instruction with length 16? In that case it would not be perfect (one could probably get better performance with larger chunks, but wouldn't want to on CPUs with fixed-sized vectors), but still significant speed up over no chunking at all?

In other words, I wonder if there's anything to be done here at all, or if the current approach would yield reasonable results as it is.

Do you have some way to check what happens? Testing on other architectures is something that would really be helpful.