#14826: Reimplement wzerorange

Ticket

https://github.com/tenstorrent/tt-metal/issues/14826

Problem description

The compiler spots that wzerorange is memset, and subtitutes the latter, leading to code bloat. The original patch caused a performance regression, presumably because memset is faster than wzerorange

What's changed

1) Use the same ASM trick to hide wzerorange's memset equivalence 2) Unroll the loop 4 fold, changing 1 write per 3 insns to 1 write per 1.5 insns

Checklist

[YES] Post commit CI passes
[ ] Blackhole Post commit (if applicable)
[ ] Model regression CI testing passes (if applicable)
[ ] Device performance regression CI testing passes (if applicable)
[ ] New/Existing tests provide coverage for changes

tenstorrent / tt-metal

#14826: Reimplement wzerorange #15340

Ticket

Problem description

What's changed

Checklist