I've been doing a bit of profiling and I note that the check for z_exe_mode at .main_loop takes 5 cycles for every Z-machine instruction executed. Since z_exe_mode is nearly always 0, we can instead use self-modifying code to make the "jmp .main_loop" instruction skip that check most of the time. Writes to z_exe_mode are handled by a subroutine which patches that jmp instruction accordingly.
This isn't a huge saving - using https://virtualconsoles.com/online-emulators/c64/ I am seeing a saving of about 0.5% on the benchmark execution time. On the other hand, the change is relatively unintrusive and doesn't complicate the code too much. I also suspect the benchmark on that emulator is dominated by emulated disk access and running with a REU (or on more CPU-bound games) would show a larger percentage reduction.
I wouldn't blame you if you didn't want to take this change, but I thought I'd at least offer it. It's been lightly tested, if you like the idea but it seems buggy please let me know.
(I am seeing 1.5% worse performance on the benchmark on my Acorn port with Ozmoo 5.3 compared to Ozmoo 4.4. I understand why it's slower and it's really a very modest slowdown which no one is likely to notice in practice - you've done a great job adding flexibility without a huge performance hit - but I would like to find a way to claw back that performance if I can. :-) )
PS If the TIMING and PRINT_SPEED checks at .main_loop could be moved to after the z_exe_mode check - which might well be acceptable, but I didn't feel too comfortable changing this - this pull request could be simplified as we'd no longer need the COMPLEX_MAIN_LOOP case.
I've been doing a bit of profiling and I note that the check for z_exe_mode at .main_loop takes 5 cycles for every Z-machine instruction executed. Since z_exe_mode is nearly always 0, we can instead use self-modifying code to make the "jmp .main_loop" instruction skip that check most of the time. Writes to z_exe_mode are handled by a subroutine which patches that jmp instruction accordingly.
This isn't a huge saving - using https://virtualconsoles.com/online-emulators/c64/ I am seeing a saving of about 0.5% on the benchmark execution time. On the other hand, the change is relatively unintrusive and doesn't complicate the code too much. I also suspect the benchmark on that emulator is dominated by emulated disk access and running with a REU (or on more CPU-bound games) would show a larger percentage reduction.
I wouldn't blame you if you didn't want to take this change, but I thought I'd at least offer it. It's been lightly tested, if you like the idea but it seems buggy please let me know.
(I am seeing 1.5% worse performance on the benchmark on my Acorn port with Ozmoo 5.3 compared to Ozmoo 4.4. I understand why it's slower and it's really a very modest slowdown which no one is likely to notice in practice - you've done a great job adding flexibility without a huge performance hit - but I would like to find a way to claw back that performance if I can. :-) )
PS If the TIMING and PRINT_SPEED checks at .main_loop could be moved to after the z_exe_mode check - which might well be acceptable, but I didn't feel too comfortable changing this - this pull request could be simplified as we'd no longer need the COMPLEX_MAIN_LOOP case.