johanberntsson / ozmoo

A Z-machine interpreter for the Commodore 64 and similar computers
GNU General Public License v2.0
117 stars 19 forks source link

Screenkernal scrolling code optimizations #51

Closed 0ldSkull closed 2 years ago

0ldSkull commented 2 years ago

I've taken techniques from the data movement part of my smooth-scrolling code and applied them to optimize the scrolling in screenkernal.asm. (This does not add smooth scrolling, but speeds up the regular jump-scrolling.)

Four phases of optimization are applied (described below). Each phase of optimization is represented in a separate commit in this PR. If preferred I could submit each as a separate PR instead (though not all at once since they affect the same code). The phases are not strictly dependent on one another.

Calculations and measurements assume a 40x25 screen with a one-line status line for reference, but the code is not limited to this configuration. These estimates use the minimum cycles for each instruction, except for the loop branch which is always taken.

The cumulative estimate using the above numbers is 3355 cycles saved. The net estimate of analyzing the final code state is 3407 cycles. In actual measurements using VICE (disabling interrupts and waiting for a specific raster line to get consistency) the difference was 3868 cycles, including VIC bad lines. Accounting for bad lines it should be about 3567 execution cycles. The interpreter size increased 98 bytes.

johanberntsson commented 2 years ago

Thanks for the PR. From what we understand, this saves ~1/10 s when scrolling the screen a full 25 lines. This may not be worth the cost of 100 bytes to us.

With every change or addition we make, we weigh the benefit to the cost, and to a large extent, the cost is measured in bytes. We’re using about 62.5 KB of the RAM in the C64, to make sure the interpreter needs to load game data from disk as seldomly as possible. We also have a limitation that only RAM up to $cfff can be used for dynamic memory - the portion of a game file that works as RAM rather than ROM. As the interpreter becomes bigger, it sometimes passes another 512-byte border, giving us 512 bytes less for game data in total, as well as 512 bytes less for dynamic memory. This limits which games are playable on the C64.

0ldSkull commented 2 years ago

That is a fair consideration, and I appreciate the additional information. In reviewing my timing measurements again, I think that these optimizations are also probably not strictly required for smooth scrolling as well. Outside of Ozmoo, having this part of the code be as fast as possible was desirable. But it doesn't look as critical with the way I'm currently scheduling the movement and scrolling activities.

Here is some detail for how each change affects the cycles and bytes saved/gained:

opt cycles  bytes
 1    -904   +35
 2    -503   -19
 3   -1500   +48
 4    -448   +34

It may be that 2 could be potentially worthwhile in terms of cost (since it presents a savings of both cycles and memory), though the gain is somewhat minimal. For 3 independently the values would probably be half of the incremental effects shown. (4 is applicable only after 1.)

If you want to consider e.g. just the second optimization independently then I can do some further testing and evaluation with that. Or we can drop it if that is best, and I can still work on smooth scrolling without the additional optimized data movement.

0ldSkull commented 2 years ago

Closing based on discussion.