llvm-mos / llvm-mos

Port of LLVM to the MOS 6502 and related processors
Other
428 stars 46 forks source link

Loop Index Optimization to avoid cmp #343

Open jroweboy opened 1 year ago

jroweboy commented 1 year ago

A common 6502 loop optimization is to avoid the compare on the loop iterator when doing a loop that doesn't use the iterator for indirect addressing. Two ways that this optimization can be implemented is to either change the iterator to count down, or offset all of the absolute address accesses by the final loop counter.

Here's the standard copy loop as generated by llvm-mos today:

constexpr int kNumEntities = 20;
char screen_x;
extern char x_pos[kNumEntities];
void loop() {
    for (int i = 0; i < kNumEntities; ++i) {
        x_pos[i] += screen_x;
    }
}
loop():                               ; @loop()
        ldx     #0
        ldy     screen_x
.LBB0_1:                                ; =>This Inner Loop Header: Depth=1
        tya
        clc
        adc     x_pos,x
        sta     x_pos,x
        inx
        cpx     #20
        bne     .LBB0_1
        rts

But with this optimization, the loop could become the following. The accesses to the addresses end up being in the same order, which makes it a pretty safe optimization in general.

loop():                               ; @loop()
        ldx     #256 - 20
        ldy     screen_x
.LBB0_1:                                ; =>This Inner Loop Header: Depth=1
        tya
        clc
        adc     x_pos - (256 - 20),x
        sta     x_pos - (256 - 20),x
        inx
        bne     .LBB0_1
        rts
asiekierka commented 1 year ago

It seems LLVM already has some machinery for making this possible, which is visible on zero-page pointers (where the base and index bitwidth is the same). With the latest batch of __zeropage patches, a slightly modified version of the above code:

constexpr int kNumEntities = 20;
char screen_x;
char x_pos[kNumEntities] __zeropage;
int main() {
    for (int i = 0; i < kNumEntities; ++i) {
        x_pos[i] += screen_x;
    }
}

compiles down to:

0000a01f <main>:
    a01f: a2 ec         ldx #$ec
    a021: ac 00 02      ldy $200                    ; 0x200 <screen_x>
    a024: 98            tya
    a025: 18            clc
    a026: 75 34         adc $34,x                   ; 0x34 <x_pos+0x14>
    a028: 95 34         sta $34,x                   ; 0x34 <x_pos+0x14>
    a02a: e8            inx
    a02b: d0 f7         bne $a024 <main+0x5>
    a02d: a2 00         ldx #$0
    a02f: 8a            txa
    a030: 60            rts

which is a perfect match for the suggested optimization.