Closed Romain-Piquois closed 5 years ago
@Romain-Piquois : It's definitively relevant on ARM. The most prevalent ARM GNU/Linux platform is by far the Raspberry Pi, where the KMSDRM compatible VC4 driver gives the lowest possible latency on any ARM platform. People all around the word use the ARM CPU of the Raspberry Pi to emulate the Snes, and Snes9x is by far the most popular snes emulation core on ARM. So I would say it is as relevant as it can get :)
In a low-latency RetroArch configuration (max_swapchain=2), Snes9x 1.55 is having slowdowns in several places games using mode7. So this, and the SPC700 threading, would help a lot.
Romain - any optimization made even if it makes the game only 0.2% faster is good, Good work btw.
I'd say as long as the readability and maintainability isn't sacrificed too much.
if its not broken dont fix it....
I looked over some of the optimizations, but the logic in most of them was changed. The bit operations didn’t yield the same results.
Hi ! Sorry to bother asking the question here, but I wanted to make sure I could contact somebody about that.
I have made optimization to the mode 7 more than 10 years ago for the PSP version. (me and my friend asked the snes9x author at the time which version we should use given our cpu target and I can recall we used something like version 1.35). Our code has always been open but it seems that those optimisations never made it back to Snes9x code base.
The code being different a bit, I am currently backporting myself this part to the latest Snes9x code on my local machine.
1st optimization (all) : save 2 "shift", 2 "and" per pixel. but gives the same bit exact results. Then 2nd optimization (special case) : Consider the case where texture space vector is [aa,0] => Optimized the loop for horizontal stretching only. (again same bit exact result, just moved precomputed constant values outside the pixel loop).
FYI, this speed upgrade improved the performance on the PSP about 30 to 35% in the best case (map walking, no rotation in FF6 as an example) and 2 to 3% in average with rotation (frame rate on a 300 Mhz CPU).
Of course, more complex blending mode / pixel operations makes the savings smaller and smaller...
To be clear, the code does not loose much in maintainability, but grow up by around 30 to 40 lines I think. (Edit : was +65 lines)
While it may not matter for x86 platform, I was just wondering about ARM based core and I would personally believe that it is still relevant today to have code that is faster and do exactly the same job bit wise.
But I prefer to ask first if I could commit a branch to this project.
Best regards.