Closed 0ldSkull closed 2 years ago
Thanks for the PR. From what we understand, this saves ~1/10 s when scrolling the screen a full 25 lines. This may not be worth the cost of 100 bytes to us.
With every change or addition we make, we weigh the benefit to the cost, and to a large extent, the cost is measured in bytes. We’re using about 62.5 KB of the RAM in the C64, to make sure the interpreter needs to load game data from disk as seldomly as possible. We also have a limitation that only RAM up to $cfff can be used for dynamic memory - the portion of a game file that works as RAM rather than ROM. As the interpreter becomes bigger, it sometimes passes another 512-byte border, giving us 512 bytes less for game data in total, as well as 512 bytes less for dynamic memory. This limits which games are playable on the C64.
That is a fair consideration, and I appreciate the additional information. In reviewing my timing measurements again, I think that these optimizations are also probably not strictly required for smooth scrolling as well. Outside of Ozmoo, having this part of the code be as fast as possible was desirable. But it doesn't look as critical with the way I'm currently scheduling the movement and scrolling activities.
Here is some detail for how each change affects the cycles and bytes saved/gained:
opt cycles bytes
1 -904 +35
2 -503 -19
3 -1500 +48
4 -448 +34
It may be that 2 could be potentially worthwhile in terms of cost (since it presents a savings of both cycles and memory), though the gain is somewhat minimal. For 3 independently the values would probably be half of the incremental effects shown. (4 is applicable only after 1.)
If you want to consider e.g. just the second optimization independently then I can do some further testing and evaluation with that. Or we can drop it if that is best, and I can still work on smooth scrolling without the additional optimized data movement.
Closing based on discussion.
I've taken techniques from the data movement part of my smooth-scrolling code and applied them to optimize the scrolling in screenkernal.asm. (This does not add smooth scrolling, but speeds up the regular jump-scrolling.)
Four phases of optimization are applied (described below). Each phase of optimization is represented in a separate commit in this PR. If preferred I could submit each as a separate PR instead (though not all at once since they affect the same code). The phases are not strictly dependent on one another.
Calculations and measurements assume a 40x25 screen with a one-line status line for reference, but the code is not limited to this configuration. These estimates use the minimum cycles for each instruction, except for the loop branch which is always taken.
Move two characters per inner (column) loop iteration. The loop branch (
bpl
) costs 3 cycles. Using half the iterations saves 60 cycles per row, at the cost of updating the additional addresses once per row. save: 1380 cycles cost: 460 cycles (2 moresta
+inc
per row) cost: 16 cycles (additional initialization) net: -904 cyclesAssume that screen (
SCREEN_ADDRESS
) and color (COLOUR_ADDRESS
) memory both start on page boundaries. This means that the low byte value is always the same for corresponding screen and color addresses, allowing for some consolidation of updating color addresses. save: 506 cycles (22 per row) save: 3 cycles (initialization) cost: 0 cycles net: -503 cyclesUse
STA $nnnn,Y
instead ofSTA ($nn),Y
. The zero page indirect indexed instruction costs an additional clock cycle. The change saves 80 cycles per row, at the cost of having to update thesta
addresses in the outer loop.save: 1840 cycles (1420 per row) cost: 299 cycles (40 per row for additional lda/sta, less 27/row from zp updates) cost: 41 cycles (additional initialization) net: -1500 cycles
Eliminate the additional
dey
(2 cycles) in the inner loop. The first optimization moved two consecutive characters during each iteration, doubling the loop work to cut the iterations in half. By dividing the line in half and handling the addresses separately, the extra decrement can be removed at the cost of additional address maintenance in the outer loop. save: 920 cycles (2*20 per row) cost: 460 cycles (20 per row) cost: 12 cycles (initialization) net: -448 cyclesThe cumulative estimate using the above numbers is 3355 cycles saved. The net estimate of analyzing the final code state is 3407 cycles. In actual measurements using VICE (disabling interrupts and waiting for a specific raster line to get consistency) the difference was 3868 cycles, including VIC bad lines. Accounting for bad lines it should be about 3567 execution cycles. The interpreter size increased 98 bytes.