hrydgard / ppsspp

A PSP emulator for Android, Windows, Mac and Linux, written in C++. Want to contribute? Join us on Discord at https://discord.gg/5NJB6dD or just send pull requests / issues. For discussion use the forums at forums.ppsspp.org.
https://www.ppsspp.org
Other
11.22k stars 2.17k forks source link

Question about Block Transfer and CPU Cores #11687

Closed mrfixit2001 closed 5 years ago

mrfixit2001 commented 5 years ago

So I realize that block transfer effects are still rather slow, especially noticeable in Burnout and Tekken 6. During some troubleshooting on our RockPro64 (big.LITTLE architecture - 6 cores, 4 little and 2 big) we discovered that if we disabled the 4 little cores (A53) and forced the board to only utilize the 2 big cores (A72) then it dramatically increased performance and both games are at around 100% fps. So the question I have is can you think of a way for me to programatically force ppsspp to only use the 2 big cores? Or just allowing the block transfer effects to prioritize these cores? Open to ANY ideas or suggestions, willing to test!

unknownbrackets commented 5 years ago

This operation is mostly bound by the GPU driver. It does involve some CPU side conversion sometimes, but this is usually not the expensive part of the operation.

In some cases, I've suspected that governors may only look at application (read: not GPU driver) CPU usage to determine whether to use big or little cores. So if the GPU driver is super slammed, but PPSSPP is just sitting waiting for the GPU driver to finish, the governor may think the best option is to downclock.

As far as I know, some popular Android game apps include bitcoin miners. Ironically, this probably improves game performance (at a battery cost), because it keeps the CPU evenly clocked up.

Also, repeating something I've said before here:

This is one of those strange, paradoxical things about mobile devices. Basically, device makers had these options:

  1. Run apps like PPSSPP at MAX SPEED ALL THE TIME, and have a battery life of 30 minutes.
  2. Run apps only as fast as they absolutely NEED to be, and have hours or days of battery life.

Some device makers picked option 1, but no one bought their phones because crappy battery life sucks, so those companies or divisions received Darwin Awards and everyone left was choosing option 2. Now they had a new choice:

  1. Ask apps how fast they want to run.
  2. Decide on the operating system level how fast the apps ought to run.

Some device makers again picked option 1. Unfortunately, app makers all wanted their apps (and only theirs, not other loser apps made by other people) to be FAST, so it became the same as option 1 the other time around (since every app said it wanted MAX SPEED), and no one bought those phones. A new round of Darwin Awards, and all the phone makers left had chosen option 2.

And that's where we are now. Phone makers call the shots and "figure out" what speed your phone should be when running PPSSPP.

When you run an app for a while, it decides it needs to maximize battery life and gives that app the slow mode dunce cap. When a new app pops up and wants its 5 seconds of glory, that app gets MAX SPEED. Those are just the rules on Android - PPSSPP has no say in the matter.

-[Unknown]

mrfixit2001 commented 5 years ago

Thank you greatly for the detailed response! A few more pieces of information - not using a mobile device, it's a SBC similar to an RPI with a rk3399 chipset running Linux, not android.

So you're saying you think perhaps adjusting the CPU governors in the kernel may have a positive effect? No suggestions for compile flags or code tweaks to force the big cores in ppsspp?

unknownbrackets commented 5 years ago

Ah, not Android. Well, I think CPU governors are similar but it may in that case be possible to set cpu affinity.

It's still a minefield when it comes to the CPU because sometimes the kernel will not present the cores as "different" cores. Read this:

https://www.sisoftware.co.uk/2015/06/22/arm-big-little-the-trouble-with-heterogeneous-multi-processing-when-4-are-better-than-8-or-when-8-is-not-always-the-lucky-number/ https://medium.com/@jadr2ddude/a-big-little-problem-a-tale-of-big-little-gone-wrong-e7778ce744bb

It's possible there may be no API that exposes the "big" cores and therefore no way to set affinity. But it depends on the kernel and the chip.

Check /sys/devices/system/cpu/possible, /sys/devices/system/cpu/active, /sys/devices/system/cpu/inactive, etc. If you see 6 total cores (0-5) then it's probably possible for you to choose. Maybe pthread_setaffinity_np might work in that case.

-[Unknown]

mrfixit2001 commented 5 years ago

So the relevant entries under /sys/devices/system/cpu that I have are isolated, offline, online, possible, and present. Along with folders for cpu0-5. And "0-5" is the value for possible, online, and present. I can confirm that CPUs 0-3 are little, while 4 and 5 are the big ones. So you have me hopeful that pthread_setaffinity_np is possible. From reading some of your commits, would the best place for this be inside GLRenderManager.cpp? How would you suggest testing this implementation for GLES / GBM graphics?

mrfixit2001 commented 5 years ago

I went ahead and tested a few things... and your comment was able to lead me to a potential solution :) This works:

--- a/ext/native/thread/threadutil.cpp
+++ b/ext/native/thread/threadutil.cpp
@@ -98,6 +98,15 @@

 #if defined(__ANDROID__) || (defined(__GLIBC__) && defined(_GNU_SOURCE))
    pthread_setname_np(pthread_self(), threadName);
+
+   const pthread_t pid = pthread_self();
+   int temp;
+   cpu_set_t cpu_set;
+   CPU_ZERO(&cpu_set);
+   CPU_SET(4, &cpu_set);
+   CPU_SET(5, &cpu_set);
+   temp = pthread_setaffinity_np(pid, sizeof(cpu_set_t), &cpu_set);
+   printf("setaffinity=%d\n", temp);
 #elif defined(__APPLE__)
    pthread_setname_np(threadName);
 // #else

I have no doubt that you will have a better way of implementing this, but at least I can confirm it does seem to do what I'm after and force everything to run only on cores 4 and 5. Preferably I would only do this for the GPU threads and not all of them, if you are able to guide me to accomplish that.

unknownbrackets commented 5 years ago

I wonder if it's better to run everything on the big cores if running anything there. Could be some performance testing needed to determine that...

Does anything under /sys/devices/system/cpu indicate that these cores are big cores? For example /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq vs /sys/devices/system/cpu/cpu5/cpufreq/cpuinfo_max_freq?

-[Unknown]

mrfixit2001 commented 5 years ago

I will check that value later if you still need, but I was thinking this is a good way to check which cores are big:

cat /proc/cpuinfo

processor : 0 model name : ARMv8 Processor rev 4 (v8l) BogoMIPS : 48.00 Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt lpae evtstrm aes pmull sha1 sha2 crc32 CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x0 CPU part : 0xd03 CPU revision : 4

processor : 1 model name : ARMv8 Processor rev 4 (v8l) BogoMIPS : 48.00 Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt lpae evtstrm aes pmull sha1 sha2 crc32 CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x0 CPU part : 0xd03 CPU revision : 4

processor : 2 model name : ARMv8 Processor rev 4 (v8l) BogoMIPS : 48.00 Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt lpae evtstrm aes pmull sha1 sha2 crc32 CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x0 CPU part : 0xd03 CPU revision : 4

processor : 3 model name : ARMv8 Processor rev 4 (v8l) BogoMIPS : 48.00 Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt lpae evtstrm aes pmull sha1 sha2 crc32 CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x0 CPU part : 0xd03 CPU revision : 4

processor : 4 model name : ARMv8 Processor rev 2 (v8l) BogoMIPS : 48.00 Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt lpae evtstrm aes pmull sha1 sha2 crc32 CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x0 CPU part : 0xd08 CPU revision : 2

processor : 5 model name : ARMv8 Processor rev 2 (v8l) BogoMIPS : 48.00 Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt lpae evtstrm aes pmull sha1 sha2 crc32 CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x0 CPU part : 0xd08 CPU revision : 2

Serial : 0000000000000000

unknownbrackets commented 5 years ago

Well, without keeping a database of processors there's no way to tell there which are big and little generically. I mean, 4/5 are older revision and slightly different part #, but that's it. Same bogomips even. And using a database seems silly, especially as new devices come out.

-[Unknown]

mrfixit2001 commented 5 years ago

I agree. Something like this would work, but would need to be constantly updated.

    switch (cpuid.implementer) {
    case 0x41: // ARM
        switch (cpuid.part) {
        case 0xb02: return CPU::arm_mpcore;
        case 0xb36: return CPU::arm_1136jf_s;
        case 0xb56: return CPU::arm_1156t2f_s;
        case 0xb76: return CPU::arm_1176jzf_s;
        case 0xc20: return CPU::arm_cortex_m0;
        case 0xc21: return CPU::arm_cortex_m1;
        case 0xc23: return CPU::arm_cortex_m3;
        case 0xc24: return CPU::arm_cortex_m4;
        case 0xc27: return CPU::arm_cortex_m7;
        case 0xd20: return CPU::arm_cortex_m23;
        case 0xd21: return CPU::arm_cortex_m33;
        case 0xc05: return CPU::arm_cortex_a5;
        case 0xc07: return CPU::arm_cortex_a7;
        case 0xc08: return CPU::arm_cortex_a8;
        case 0xc09: return CPU::arm_cortex_a9;
        case 0xc0d: return CPU::arm_cortex_a12;
        case 0xc0f: return CPU::arm_cortex_a15;
        case 0xc0e: return CPU::arm_cortex_a17;
        case 0xc14: return CPU::arm_cortex_r4;
        case 0xc15: return CPU::arm_cortex_r5;
        case 0xc17: return CPU::arm_cortex_r7;
        case 0xc18: return CPU::arm_cortex_r8;
        case 0xd13: return CPU::arm_cortex_r52;
        case 0xd01: return CPU::arm_cortex_a32;
        case 0xd04: return CPU::arm_cortex_a35;
        case 0xd03: return CPU::arm_cortex_a53;
        case 0xd05: return CPU::arm_cortex_a55;
        case 0xd07: return CPU::arm_cortex_a57;
        case 0xd08: return CPU::arm_cortex_a72;
        case 0xd09: return CPU::arm_cortex_a73;
        case 0xd0a: return CPU::arm_cortex_a75;
        default: return CPU::generic;
        }
    case 0x42: // Broadcom (Cavium)
        switch (cpuid.part) {
        case 0x516: return CPU::cavium_thunderx2t99p1;
        default: return CPU::generic;
        }
    case 0x43: // Cavium
        switch (cpuid.part) {
        case 0xa0: return CPU::cavium_thunderx;
        case 0xa1:
            if (cpuid.variant == 0)
                return CPU::cavium_thunderx88p1;
            return CPU::cavium_thunderx88;
        case 0xa2: return CPU::cavium_thunderx81;
        case 0xa3: return CPU::cavium_thunderx83;
        case 0xaf: return CPU::cavium_thunderx2t99;
        default: return CPU::generic;
        }
    case 0x4e: // NVIDIA
        switch (cpuid.part) {
        case 0x000: return CPU::nvidia_denver1;
        case 0x003: return CPU::nvidia_denver2;
        default: return CPU::generic;
        }
    case 0x50: // AppliedMicro
        // x-gene 2
        // x-gene 3
        switch (cpuid.part) {
        case 0x000: return CPU::apm_xgene1;
        default: return CPU::generic;
        }
    case 0x51: // Qualcomm
        switch (cpuid.part) {
        case 0x00f:
        case 0x02d:
            return CPU::qualcomm_scorpion;
        case 0x04d:
        case 0x06f:
            return CPU::qualcomm_krait;
        case 0x201:
        case 0x205:
        case 0x211:
            return CPU::qualcomm_kyro;
        case 0x800:
        case 0x801:
            return CPU::arm_cortex_a73; // second-generation Kryo
        case 0xc00:
            return CPU::qualcomm_falkor;
        case 0xc01:
            return CPU::qualcomm_saphira;
        default: return CPU::generic;
        }
    case 0x53: // Samsung
        // exynos-m2
        // exynos-m3
        switch (cpuid.part) {
        case 0x001: return CPU::samsung_exynos_m1;
        default: return CPU::generic;
        }
    case 0x56: // Marvell
        switch (cpuid.part) {
        case 0x581:
        case 0x584:
            return CPU::marvell_pj4;
        default: return CPU::generic;
        }
    case 0x67: // Apple
        // swift
        // cyclone
        // twister
        // hurricane
        switch (cpuid.part) {
        case 0x072: return CPU::apple_typhoon;
        default: return CPU::generic;
        }
    case 0x69: // Intel
        switch (cpuid.part) {
        case 0x001: return CPU::intel_3735d;
        default: return CPU::generic;
        }
    default:
        return CPU::generic;
    }
mrfixit2001 commented 5 years ago

So are you maybe thinking of simply looping thru all the CPUs in the device tree and determining which have the higher operating freq to identify big vs little?

And I'm still puzzled why we wouldn't want to let non GPU threads use the small cores?

mrfixit2001 commented 5 years ago

For example, this seems to also work, and I think it only applies to GL threads, but I may be wrong. Let me know?

--- a/ext/native/thin3d/GLRenderManager.cpp
+++ b/ext/native/thin3d/GLRenderManager.cpp
@@ -92,6 +92,13 @@
    threadFrame_ = threadInitFrame_;
    renderThreadId = std::this_thread::get_id();

+   cpu_set_t cpu_set;
+   CPU_ZERO(&cpu_set);
+   CPU_SET(4, &cpu_set);
+   CPU_SET(5, &cpu_set);
+   int temp = pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpu_set);
+   printf("setaffinity=%d\n", temp);
+
    bool mapBuffers = (gl_extensions.bugs & BUG_ANY_MAP_BUFFER_RANGE_SLOW) == 0;
    bool hasBufferStorage = gl_extensions.ARB_buffer_storage || gl_extensions.EXT_buffer_storage;
    if (!gl_extensions.VersionGEThan(3, 0, 0) && gl_extensions.IsGLES && !hasBufferStorage) {
mrfixit2001 commented 5 years ago

So after some testing the patch applying to all threads performs better for sure. But still not as well as just disabling the 4 little cores and using only the two big ones. Any thoughts as to why?

Also discovered v1.5.4 (pre multithreading) performs perfectly with no changes at all. Just for testing, can you think of a way to disable multithreading

unknownbrackets commented 5 years ago

Not really. A lot of things had to be rewritten to enable multithreading. It could be something we are simply doing differently now after the rewrite and not necessarily related to the multithreading.

Did you get any frequency information out of /sys? Or was it the same?

If the big cores are going to be powered up anyway, putting everything on them is probably the best bet. PSP games were all built to run on a single CPU core, so their code cannot benefit from 4, 6, 32, whatever number of cores - just a single one. And your GPU driver is most likely single threaded too. Those are the most expensive things PPSSPP does, so you're probably just not "using up" the two big cores, and therefore putting anything on the little cores is just making that thing slower.

If your big cores were totally maxed out, maybe offloading something less critical path to the little cores would help. That's why I said it'd require benchmarking to be sure.

If you compare using htop or something, is 1.5.4 using different cores than 1.7.5 (with and without the affinity)? It might be that 1.5.4 just triggered whatever mystery heuristic to run on the big core. Or maybe some driver person put PPSSPP 1.5.4 thought keeping databases was a good idea and put 1.5.4 in a db, and 1.7.5 is just slower because the database has unsurprisingly not been updated.

-[Unknown]

mrfixit2001 commented 5 years ago

The two cores do return different max freq: cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq 1416000 cat /sys/devices/system/cpu/cpu5/cpufreq/cpuinfo_max_freq 1992000

I've done a LOT of testing of adding affinity to different locations in the code. I tested adding it to every spot that std::thread is being used. But I ultimately discovered the only location actually required to add this code was in SDLMain.cpp right at the beginning of int main(). That seems to pretty much force everything onto the cores I've set affinity to, and it's means that on my rk3399 I can now run burnout and tekken with zero frameskip on pspx2 resolution at 100% fps :)

mrfixit2001 commented 5 years ago

So assuming you can confirm the patches I've listed above are correctly setting affinity to the big cores, then it's safe for you to close this issue out. Perhaps in the future you may consider adding the option in the settings menu for users to control affinity themselves and select which cores to assign the main thread to?

unknownbrackets commented 5 years ago

Well, I assume putting it at the top forces all threads created by the main thread to use the same affinity.

If you run this:

cat /sys/class/power_supply/*/type

Do any of the entries contain Battery?

What I'm thinking is we could just do this automatically, if:

  1. The OS is Linux.
  2. No power supply is a battery (not laptop/tablet/phone.)
  3. There are cores exposed with a higher max freq.

That might also benefit e.g. Pi 3 or ODROID, etc. But probably they are not simultaneously exposed: https://forum.odroid.com/viewtopic.php?t=2580

-[Unknown]

mrfixit2001 commented 5 years ago

Yes, you're correct, it seems to push all threads that it spawns into the same affinity. Prior to setting it at the main thread, I applied it in a number of different locations that spawn threads, and while that did yield positive results, it was only 80% improved when compared to pushing ALL threads to the big cores.

For my implementation of Linux, the path /sys/class/power_supply has nothing in it. The device is a SBC without a battery. And yes, I would expect this to benefit any and all big.LITTLE architecture boards.

GuilhermeGS2 commented 5 years ago

Usually the big cores are the last ones, so there's no way to force the PPSSPP to use only the last core(s)?

mrfixit2001 commented 5 years ago

Setting affinity does exactly that. I'm all set, thanks for the help!

unknownbrackets commented 5 years ago

It'd still be great to get a generic implementation in core for everyone.

-[Unknown]

hrydgard commented 5 years ago

Yeah, should we open a new issue for that or reuse this one? Leaning towards the former.

luizthiagor commented 5 years ago

I believe this change could be great, even for rpi3, because, as i can open issue here in october/2018 the last version we can use in rpi with good performance is the 1.5.4 ...so, until today, we cant update and not solve issues from old version .

Maybe, this could be the key for solving this problem of slow performance after 1.6.3 in this board and, if possible to do an "multithread ON/OFF" or "Affinity" option could solve the issue and us can get new versions running ...Hope for this one day and ,again, thanks for all devs and team of expert users of ppsspp !

proganime1200 commented 4 years ago

can we open this on lower end devices tekken 6 have performance issue even using octacore 2ghz redmi 7a

unknownbrackets commented 4 years ago

We have generally very little influence on Android of scheduling. Most likely, only 4 of your cores (the 1.95 GHz ones) are active at a time, and thermal throttling may prevent this from lasting more than a few minutes.

On Android, possibly the best option we have is "sustained performance."

This issue was about devices that run Linux, do less thermal throttling, and give more control.

-[Unknown]