Open FiloSottile opened 3 years ago
Yeah, this is the general problem where currently the compiler doesn't reorder loads and stores. For example, in this case, one load of v.l1
is before the store of v.l0
, and another load of v.l1
is after. The compiler doesn't have alias analysis and doesn't know the store of v.l0
won't change v.l1
, so it loads again. This also makes it hard to use LDP and STP.
Do we have plans to enable alias analysis? Thank you.
No current plans. Alias analysis tends to be expensive in compile time. We'd want something that is quick but accurate enough to be useful. I don't think anyone has an idea how to do that yet.
No current plans. Alias analysis tends to be expensive in compile time. We'd want something that is quick but accurate enough to be useful. I don't think anyone has an idea how to do that yet.
Perhaps a simple implementation with not so wide coverage should not be very expensive (I haven't done any experiments, just by feeling), such as the above case, we can easily analyze that there is no dependency between the write
of v.10
and the second load
of v.11
. I wonder if it makes sense to do so?
Alias analysis tends to be expensive in compile time.
There are the obvious/trivial ones: SP offsets with non-overlapping extents do not alias, SP does not alias SB. These matter less in a reg ABI but are very cheap and potentially offer some wins.
Maybe also the GCC SRA pass could be used as inspiration for work in this area: https://gcc.gnu.org/wiki/summit2010?action=AttachFile&do=get&target=jambor.pdf
Even in trivial cases ldp
and stp
aren't emitted: https://godbolt.org/z/fdsEYr1h4
What version of Go are you using (
go version
)?What did you do?
Compiled this function.
Full codebase at https://github.com/FiloSottile/edwards25519/pull/8
What did you expect to see?
What did you see instead?
The compiler figures out the same AND, ADD, and LSR+MADD that my hand-written assembly uses, but note how it loads the inputs twice from memory and looks like it doesn't know about STP and LDP.
Not sure which part makes the most effect, but I got a 10% speedup on some high-level functions (although not on microbenchmarks of thinner functions) between my assembly and the compiler with
go:noinline
. (Interestingly, if I let the compiler inline the high level functions get even slower, while the thin ones get faster.)