Open fkuehnel opened 3 years ago
I dug into the details a bit further. The code seems optimal when arr changes back to uint32 or uint64 in arr []uint8.
The culprit are the rewrite rules in ARM64.rules
where it states: (Less8U x y) => (LessThanU (CMPW (ZeroExt8to32 x) (ZeroExt8to32 y))) (Less16U x y) => (LessThanU (CMPW (ZeroExt16to32 x) (ZeroExt16to32 y))) (Less32U x y) => (LessThanU (CMPW x y)) (Less64U x y) => (LessThanU (CMP x y))
Since x and y are already loaded from memory as a single byte and is zero extended by default, those zero extensions introduce additional superfluous register movements.
Any ideas how to fix this?
Maybe we can add rules of the form (Less8U x:(MOVBUload _ _) y:(MOVBUload _ _)) => (LessThanU (CMPW x y))
to absorb the zero-extensions
to fix this case.
And there are so many other rules that generate zero-extensions
, for example, (Div16u x y) => (UDIVW (ZeroExt16to32 x) (ZeroExt16to32 y))
,for this rule, if x
or y
are already loaded from memory as half a word and they are also zero extended.
If we want to add the above rewirte rules to fix these problems, we need too many. Any other good ideas to fix them? Thank you. @randall77 @cherrymui
BTW, if @fkuehnel requires this case to have a efficient assembly code soon, I can submit a patch to add above rewrite rules to fix it.
I had hoped that tight loop code inefficiencies would have been addressed by now, Go 1.19.2, still we have 3 more instructions injected with the ARM64 code because the code generator has no means to understand that loading an 8/16 bit byte/half-word zeros the rest of the Rx registers...
good compile with 64/32 bit word, double word:
0x0030 00048 (lomuto.go:39) MOVD (R0)(R4<<3), R2
...
0x004c 00076 (lomuto.go:43) MOVD (R0)(R3<<3), R7
0x0050 00080 (lomuto.go:45) CMP R2, R7
bad compile with 8/16 bit byte/half-word: (3 unnecessary operations) 0x0030 00048 (lomuto.go:39) MOVBU (R0)(R4), R2 ... 0x004c 00076 (lomuto.go:43) MOVBU (R0)(R3), R7 0x0050 00080 (lomuto.go:45) MOVD R7, R8 0x0054 00084 (lomuto.go:45) MOVD R2, R9 0x0058 00088 (lomuto.go:45) CMPW R8, R2 ... 0x0078 00120 (lomuto.go:45) MOVD R9, R2
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
Compile this code with: go build -gcflags -S lomuto.go
https://play.golang.org/p/F8fPWbzvDRO
What did you expect to see?
with clang -O3 -S I see a tight inner loop between LBB0_2 and LBB0_5 with very minimal instructions
What did you see instead?
I see excessive register usage and many more instructions between address 64 and 124