The compiler spots that wzerorange is memset, and subtitutes the latter, leading to code bloat. The original patch caused a performance regression, presumably because memset is faster than wzerorange
What's changed
1) Use the same ASM trick to hide wzerorange's memset equivalence
2) Unroll the loop 4 fold, changing 1 write per 3 insns to 1 write per 1.5 insns
Checklist
[YES] Post commit CI passes
[ ] Blackhole Post commit (if applicable)
[ ] Model regression CI testing passes (if applicable)
[ ] Device performance regression CI testing passes (if applicable)
[ ] New/Existing tests provide coverage for changes
Ticket
https://github.com/tenstorrent/tt-metal/issues/14826
Problem description
The compiler spots that wzerorange is memset, and subtitutes the latter, leading to code bloat. The original patch caused a performance regression, presumably because memset is faster than wzerorange
What's changed
1) Use the same ASM trick to hide wzerorange's memset equivalence 2) Unroll the loop 4 fold, changing 1 write per 3 insns to 1 write per 1.5 insns
Checklist