Here are some shuffles which I found myself regularly using. Please consider them as possible additions:
Lane shift (e.g. a left-shift of 2 lanes turns |x1 x2 ... xN| into |x3 x4 ... xN P P| where P is a user-chosen padding scalar).
Can be implemented as a Swizzle2 with a broadcasted scalar and this is probably optimal on x86 before AVX-512. But AVX-512 can do better with masking if P=0 and the shift is known at compile time, which is a common special case. I don't know what the situation is on other SIMD instruction sets.
Transpose (turn a set of N vectors |m11 m12 ... m1N|, |m21 m22 ... m2N|, ..., |mN1 mN2 ... mNN| into another set of N vectors |m11 m21 ... mN1|, |m12 m22 ... mN2|, ..., |m1N m2N ... mNN|).
On x86, doing this efficiently requires non-obvious and instruction set dependent code, so since this is a relatively common need, I think a dedicated operation at the root of the crate could make sense.
Here are some shuffles which I found myself regularly using. Please consider them as possible additions: