segmentio / asm

Go library providing algorithms optimized to leverage the characteristics of modern CPUs
MIT No Attribution
869 stars 36 forks source link

Try manually assigning registers #68

Open pelletier opened 2 years ago

pelletier commented 2 years ago

https://github.com/segmentio/asm/pull/58#discussion_r778501347

mmcloughlin commented 2 years ago

Could you elaborate on the problem you've seen with the avo allocator, and consider filing an issue on the avo side?

pelletier commented 2 years ago

This issue comes from @chriso saying there may be dependencies being created between YMM registers that could be avoiding by manually picking registers: https://github.com/segmentio/asm/pull/58#discussion_r778501347.

I haven't looked closely at what dependencies exist today though, so I don't know if there is an actual problem. I'd be happy to open an issue on avo if I have something concrete! Did a quick scan of the register allocation, which looks like:

Y0:  incompleteMaskY
Y1:  continuation4BytesY
Y2:  continuation3BytesY
Y3:  nibble1Y
Y4:  nibble2Y
Y5:  nibble3Y
Y6:  nibbleMaskY
Y7:  msbMaskY
Y8:  previousBlockY     vp      off3    continuationBitsY
Y9:  errorY
Y10: incompletePreviousBlockY   previousY   off2    lowPrev         highCurr    tmpY
Y11: currentBlockY      tmp2
Y12: zeroes         blend       highPrev
Y13: shuffle
Y14: tmp1
// Y15?

I'm curious why Y15 wasn't used. At first glance I'm a little surprised to have registers like Y10 that are used for many things, but Y13 and Y14 that are only used on a short section of the loop.

mmcloughlin commented 2 years ago

Ah, okay, so it's not failing to find an allocation, but you think a different allocation would improve performance.

Regarding Y15, I think the algorithm as written will use the minimum number of registers it can, so that should be equal to the max number of live variables at any point. In your case that must be 15, so the highest used is Y14.

I can't remember the precise implementation, but it's picking an allocation from a pool of available registers. Perhaps changing this logic to LRU would approximate what you're looking for, and minimize dependencies. There's likely a ton of research on this that I should read up on. The original allocator was intended to be simple.