Open mapleFU opened 1 month ago
Yeah we often see this in Rust that short but variable-length memcpy's don't optimize well.
https://rust.godbolt.org/z/4P9xxeYsc
With AVX it should be possible to do masked moves for multiple-of-word-size types. On AVX512 it would even work for byte-sized ones.
This frequently comes up when attempting to vectorize code by chunking slices into arrays. The variable-length tail then ends up being a small but variable-length copy.
The code below is in apache arrow cpp[1]. The arrow-rs also has similiar phenomenon[2].
To be short, when size is gurantee to be less or equal to
12
, gcc would inline thememcpy
andmemset
but the clang don't optimize this. See godbolt link [3]. The problem is still exists when-ffreestanding
is enabled.Would this being a problem? If it can be fixed with some compiler flags, what flag should I use?
[1] https://github.com/apache/arrow/blob/63b34c97c5d3ca6d20dacb9e92b404986f1d7d62/cpp/src/arrow/util/binary_view_util.h#L28 [2] https://github.com/apache/arrow-rs/issues/6034 [3] https://godbolt.org/z/47T8s69xK