Open vfdff opened 1 year ago
vec mla1(vec v0, vec v1, int v2) { return v0 - v1 * v2; }
fmov
mls
fmov s31, w0 mls v0.4s, v1.4s, v31.s[0] ret
dup
dup v2.4s, w0 mls v0.4s, v2.4s, v1.4s ret
For some targets (I'm not sure all targets), the latency of dup instruction from scalar register to vector register is much more than 1, while the latency of fmov from scalar register to float register is usual 1, so the GCC's assemble is better.
@llvm/issue-subscribers-backend-aarch64
may be similar to https://reviews.llvm.org/D126632
a simliar cases for mull, https://gcc.godbolt.org/z/Pzvon7Yqv
fmov
+mls
dup
+mls
For some targets (I'm not sure all targets), the latency of
dup
instruction from scalar register to vector register is much more than 1, while the latency offmov
from scalar register to float register is usual 1, so the GCC's assemble is better.