These can have different performance characteristics depending on the machine. On Zen 2, vpermq has a latency of 6, whereas the rest of these instructions all have a latency of 1. Except, of course, there's also the vpaddw which has a memory operand, which I presume will be slower than not going to memory, assuming the compiler is right to prefer an identity vpcmpeqd to load all 1's in a vector, rather than using a memory operand for the same purpose.
These two functions are equivalent (on little-endian): (Godbolt link)
However, they compile differently:
This especially becomes a problem if we increase
VEC_SIZE
to 16:These can have different performance characteristics depending on the machine. On Zen 2,
vpermq
has a latency of 6, whereas the rest of these instructions all have a latency of 1. Except, of course, there's also thevpaddw
which has a memory operand, which I presume will be slower than not going to memory, assuming the compiler is right to prefer an identityvpcmpeqd
to load all 1's in a vector, rather than using a memory operand for the same purpose.