Incorrect concurrency on multi-CPU systems

jdp1024 commented 5 months ago

Describe the issue

I have a dual EPYC 7763 system running Windows 11 23H2 with 128GB RAM. With the NUMA commit a169c78b6d3b082068deb49a39aaa1fd75464c7f, a 256 thread Stockfish can only eat 25% CPU time.

Expected behavior

Eat 100% CPU.

Steps to reproduce

stockfish bench 512 256 24 then watch the Task Manager.

Anything else?

Stockfish uses std::thread::hardware_concurrency to get the number of CPU threads. The function doesn't take account of the number of CPUs.

I made a rough patch, after applying it, Stockfish can use all CPU time on my system.

diff --git a/src/numa.h b/src/numa.h
index 03ee1fdf..5df40c5a 100644
--- a/src/numa.h
+++ b/src/numa.h
@@ -61,6 +61,9 @@ using SetThreadSelectedCpuSetMasks_t = BOOL (*)(HANDLE, PGROUP_AFFINITY, USHORT)
 // https://learn.microsoft.com/en-us/windows/win32/api/processtopologyapi/nf-processtopologyapi-setthreadgroupaffinity
 using SetThreadGroupAffinity_t = BOOL (*)(HANDLE, const GROUP_AFFINITY*, PGROUP_AFFINITY);

+// https://learn.microsoft.com/en-us/windows/win32/api/sysinfoapi/nf-sysinfoapi-getlogicalprocessorinformationex
+using GetLogicalProcessorInformationEx_t = BOOL (*)(LOGICAL_PROCESSOR_RELATIONSHIP, PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX, PDWORD);
+
 #endif

 #include "misc.h"
@@ -70,8 +73,51 @@ namespace Stockfish {
 using CpuIndex  = size_t;
 using NumaIndex = size_t;

+inline CpuIndex get_hardware_concurrency()
+{
+    CpuIndex r = std::max<CpuIndex>(1, std::thread::hardware_concurrency());
+#ifdef _WIN32
+    HMODULE k32 = GetModuleHandle(TEXT("Kernel32.dll"));
+    auto GetLogicalProcessorInformationEx_f = GetLogicalProcessorInformationEx_t(
+        (void (*)()) GetProcAddress(k32, "GetLogicalProcessorInformationEx"));
+    if (GetLogicalProcessorInformationEx_f == nullptr)
+        return r;
+
+    DWORD length = 0;
+    if (GetLogicalProcessorInformationEx_f(RelationAll, nullptr, &length))
+        return r;
+    std::unique_ptr<void,void(*)(void*)>buffer(std::malloc(length), std::free);
+    if (!buffer)
+        return r;
+
+    unsigned char* mem = reinterpret_cast<unsigned char*>(buffer.get());
+    if (GetLogicalProcessorInformationEx_f(RelationAll, reinterpret_cast<PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX>(mem), &length) == false)
+        return r;
+
+    size_t concurrency = 0;
+    for (DWORD i = 0; i < length;)
+    {
+        auto *proc = reinterpret_cast<PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX>(mem + i);
+        if (proc->Relationship == RelationProcessorCore)
+        {
+            for (WORD group = 0; group < proc->Processor.GroupCount; ++group)
+            {
+                for (KAFFINITY mask = proc->Processor.GroupMask[group].Mask; mask != 0; mask >>= 1)
+                {
+                    concurrency += mask & 1;
+                }
+            }
+        }
+        i += proc->Size;
+    }
+    return concurrency;
+#else
+    return r;
+#endif
+}
+
 inline const CpuIndex SYSTEM_THREADS_NB =
-  std::max<CpuIndex>(1, std::thread::hardware_concurrency());
+  get_hardware_concurrency();

 // We want to abstract the purpose of storing the numa node index somewhat.
 // Whoever is using this does not need to know the specifics of the replication

Operating system

Windows

Stockfish version

a169c78b6d3b082068deb49a39aaa1fd75464c7f

jdp1024 commented 5 months ago

The NUMA-aware version of Stockfish doesn't work well with ChessBase 17 even after my patch: only 25% CPU can be used. Previous commits don't have this problem.

CB17 sets its process affinity in its internal SVData constructor. I have to nop all calls to SetProcessAffinity in a debugger to make the NUMA-aware version of Stockfish use 100% CPU power.

vondele commented 5 months ago

@jdp1024 thanks for reporting and testing. I believe we will need to separate the two issues, i.e. behavior under CB and without.

As a quick workaround, one should be able to set the option NumaPolicy to none . Can you test this gives 100% cpu usage under CB?

Interesting that std::thread::hardware_concurrency doesn't report the full number of available threads (it does under linux). This can probably be fixed easily along the lines of your patch.

I would be interested to know if we experience, with your patch, a performance benefit on your hardware. Could you test difference in nodes/second from the start position between having the numa policy on auto and on none ? On the command line something like

setoption name NumaPolicy value system
setoption name Threads value 256
setoption name Hash value 40000
go movetime 100000

vs

setoption name NumaPolicy value none
setoption name Threads value 256
setoption name Hash value 40000
go movetime 100000

Disservin commented 5 months ago

https://stackoverflow.com/questions/70704458/stdthreadhardware-concurrency-does-not-return-the-correct-number-of-logica

Somewhat makes sense.

jdp1024 commented 5 months ago

I would be interested to know if we experience, with your patch, a performance benefit on your hardware. Could you test difference in nodes/second from the start position between having the numa policy on auto and on none ?

I did a quick test.

(1) setoption name NumaPolicy value none

info depth 44 seldepth 62 multipv 1 score cp 21 nodes 5963915041 nps 59637361 hashfull 618 tbhits 0 time 100003 pv e2e4 e7e5 g1f3 b8c6 f1c4 f8c5 e1g1 g8f6 d2d3 d7d6 c2c3 a7a5 f1e1 e8g8 h2h3 c5a7 b1a3 c6e7 d3d4 e7g6 c1e3 c7c6 c4f1 b7b5 a3b1 e5d4 e3d4 a5a4 b1d2 h7h6 d1c2 f6d7 d4a7 a8a7 a1d1 f8e8
info depth 46 seldepth 67 multipv 1 score cp 23 nodes 5963915155 nps 59636766 hashfull 618 tbhits 0 time 100004 pv e2e4 e7e5 g1f3 b8c6 f1c4 f8c5 d2d3 g8f6 e1g1 e8g8 c2c3 d7d6 f1e1 a7a5 h2h3 c5a7 b1a3 h7h6 a3b5 a7b6 c1e3 b6e3 e1e3 f8e8 d1b3 c8e6 c4e6 e8e6 c3c4 a5a4 b3d1 e6e8 a1c1 f6d7 d3d4 c6d4 f3d4 e5d4 d1d4 d7c5 e3g3 g7g6 b5c3 c7c6 f2f4 g8h7 g3f3 d8e7 c1e1 a8d8
bestmove e2e4 ponder e7e5

(2) setoption name NumaPolicy value system

info depth 46 seldepth 66 multipv 1 score cp 23 lowerbound nodes 8711322910 nps 87112357 hashfull 786 tbhits 0 time 100001 pv e2e4
info depth 46 seldepth 71 multipv 1 score cp 22 nodes 8711322944 nps 87111487 hashfull 786 tbhits 0 time 100002 pv e2e4 e7e5 g1f3 b8c6 f1b5 g8f6 e1g1 f6e4 d2d4 e4d6 b5c6 d7c6 d4e5 d6f5 d1d8 e8d8 b1c3 f8e7 f1d1 d8e8 g2g4 f5h4 f3h4 e7h4 h2h3 h7h5 f2f3 c8e6 c3e2 e6d5 g1g2 f7f6 e2f4 a8d8 c1e3 e8f7 f4h5 f6e5 d1d2 h8f8 h5g3 f7e6 a1f1 d5c4 f1d1 d8d2 d1d2 c4d5 g3e4 a7a5 e3f2 h4e7 a2a4 g7g6 d2e2 b7b5 b2b3 b5a4 b3a4
bestmove e2e4 ponder e7e5

vondele commented 5 months ago

nice, so benefit is significant about 46% speedup.

Ipmanchess commented 5 months ago

Hi ,are you also interested to run some chess-benches for this list: https://ipmanchess.yolasite.com/amd--intel-chess-bench-stockfish.php

Sopel97 commented 5 months ago

The NUMA-aware version of Stockfish doesn't work well with ChessBase 17 even after my patch: only 25% CPU can be used. Previous commits don't have this problem.

CB17 sets its process affinity in its internal SVData constructor. I have to nop all calls to SetProcessAffinity in a debugger to make the NUMA-aware version of Stockfish use 100% CPU power.

let's make this clear, are you running stockfish through chessbase or CLI for these tests?

from what you describe it appears that stockfish is working correctly, and it's an issue with chessbase software setting affinity to a single processor group (any call to SetProcessAffinity will always do that), which stockfish inherits and now respects

jdp1024 commented 5 months ago

The NUMA-aware version of Stockfish doesn't work well with ChessBase 17 even after my patch: only 25% CPU can be used. Previous commits don't have this problem. CB17 sets its process affinity in its internal SVData constructor. I have to nop all calls to SetProcessAffinity in a debugger to make the NUMA-aware version of Stockfish use 100% CPU power.

let's make this clear, are you running stockfish through chessbase or CLI for these tests?

from what you describe it appears that stockfish is working correctly, and it's an issue with chessbase software setting affinity to a single processor group (any call to SetProcessAffinity will always do that), which stockfish inherits and now respects

I found the problem when I used CB17 with the NUMA-aware version of Stockfish, but CB17 worked fine with all previous versions of Stockfish. So I ran stockfish bench from a new console to isolate the problem, the performance was also limited. std::thread::hardware_concurrency not working for multiple CPUs was the reason. After my patch, stockfish ran with its full power (my test in this post was from a standalone stockfish, not with CB17).

I ran it with CB17 again, but the CPU usage was still limited: stockfish's processor affinity was inherited. I had to run it with a debugger inside which I noped all calls to SetProcessAffinity, then CB17 worked fine with the NUMA-aware stockfish with full power.

Sopel97 commented 5 months ago

Alright. Two issues were uncovered.

std::thread::hardware_concurrency was not returning the expected value, this will be fixed by #5311
Our logic for determining current affinity on Windows was incorrect due to several invalid assumptions, this will be mostly fixed (as much as possible given Windows API) by #5312

The problem with Chessbase will persist, however, as its SetProcessAffinity will be inherited by Stockfish and they are now respected. This is considered correct behaviour, the old behaviour was not correct (though I still don't see how it would change anything, since SetProcessAffinity should have prevented Stockfish from running on all processors anyway, though Stockfish may have been overriding this with old logic). This is a bug in Chessbase, a very serious one because it actually limits execution to a single processor group.

vondele commented 5 months ago

@jdp1024 please verify that after merge the binary works as expected for you.

jdp1024 commented 5 months ago

I can see a 100% CPU usage if I run the same test as the previous post.

For setoption name NumaPolicy value none:

info depth 48 seldepth 58 multipv 1 score cp 13 upperbound nodes 6079598829 nps 60791732 hashfull 652 tbhits 0 time 100007 pv e2e4 e7e5 info depth 47 seldepth 72 multipv 1 score cp 17 nodes 6079598945 nps 60791126 hashfull 652 tbhits 0 time 100008 pv e2e4 e7e5 g1f3 b8c6 f1c4 g8f6 d2d3 f8c5 e1g1 d7d6 c2c3 e8g8 f1e1 a7a6 h2h3 h7h6 a2a4 a6a5 c4b5 c5a7 c1e3 a7e3 e1e3 c6e7 d3d4 c7c6 b5f1 d8c7 e3e1 e7g6 c3c4 f6h7 a1a3 g6f4 d1d2 f7f6 d4d5 h7g5 e1e3 c8d7 b1c3 c6c5
bestmove e2e4 ponder e7e5

For setoption name NumaPolicy value system:

info depth 47 seldepth 60 multipv 1 score cp 17 lowerbound nodes 7884406998 nps 78839339 hashfull 797 tbhits 0 time 100006 pv e2e4
info depth 46 seldepth 70 multipv 1 score cp 18 nodes 7884406998 nps 78838551 hashfull 797 tbhits 0 time 100007 pv e2e4 e7e5 g1f3 b8c6 f1b5 g8f6 e1g1 f6e4 f1e1 e4d6 f3e5 c6e5 e1e5 f8e7 b5f1 e8g8 d2d4 e7f6 e5e2 d6f5 c2c3 d7d5 b1d2 a7a5 a2a4 c7c6 e2e1 f5d6 h2h3 c8f5 d2f3 d8b6 c1f4 d6e4 d1c2 f5g6 f3e5 f6e5 f4e5 f8e8 e5f4 e4g3 f1d3 g3h5 d3g6 h7g6
bestmove e2e4 ponder e7e5

The NPS falls down from 87111487 to 78838551. I ran the test for several times, the difference are the same.

jdp1024 commented 5 months ago

The NPS down is caused by the new main net 35aff79843658aef55426d5d88be412f54d936b8.

Sopel97 commented 5 months ago

I guess it's possible if different lines are explored

Sopel97 commented 5 months ago

@jdp1024 Hey, you're the only person I (hopefully) have contact with with a windows machine with >64 logical cpus. I'm trying to make a wrapper for Stockfish as a workaround for these GUIs, but for it to work I need to verify whether SetProcessAffinityMask calls can actually be reversed at all. Would you be able to compile the following code (g++ a.cpp -static), run it, and give me the output? It tests whether there's a difference in affinities after being set and "reset". Thanks.

#include <iostream>
#include <windows.h>

void print_affinities()
{
    DWORD_PTR proc, sys;
    BOOL status = GetProcessAffinityMask(GetCurrentProcess(), &proc, &sys);
    auto a = GetLastError();
    std::cout << a << ' ' << status << ' ' << proc << ' ' << sys << '\n';

    USHORT GroupCount = 16;
    USHORT GroupArray[16];
    status = GetProcessGroupAffinity(GetCurrentProcess(), &GroupCount, GroupArray);
    a = GetLastError();
    std::cout << a << ' ' << status << ' ' << GroupCount << '\n';
    std::cout << '\t';
    for (int i = 0; i < GroupCount; ++i)
        std::cout << GroupArray[i] << ' ';
    std::cout << '\n';
}

int main()
{
    std::cout << "Default:\n";

    print_affinities();

    std::cout << "\nSet max:\n";

    DWORD_PTR proc, sys;
    BOOL status = GetProcessAffinityMask(GetCurrentProcess(), &proc, &sys);
    SetProcessAffinityMask(GetCurrentProcess(), sys);

    print_affinities();

    std::cout << "\nSet smaller:\n";

    SetProcessAffinityMask(GetCurrentProcess(), 0xa);

    print_affinities();

    std::cout << "\nSet max:\n";

    SetProcessAffinityMask(GetCurrentProcess(), sys);

    print_affinities();

    return 0;
}

jdp1024 commented 5 months ago

Would you be able to compile the following code (g++ a.cpp -static), run it, and give me the output? It tests whether there's a difference in affinities after being set and "reset". Thanks.

Default:
0 1 18446744073709551615 18446744073709551615
0 1 4
    0 1 2 3 

Set max:
0 1 18446744073709551615 18446744073709551615
0 1 1
    3 

Set smaller:
0 1 10 18446744073709551615
0 1 1
    3 

Set max:
0 1 18446744073709551615 18446744073709551615
0 1 1
    3

SetProcessAffinityMask's side effect is irreversible.

Sopel97 commented 5 months ago

thanks, so no hope

edit. I see a way if we created a wrapper to run the GUI through, that DLL injects a SetProcessAffinityMask that does nothing, but at this point I don't want to bother. Hopefully this will force some changes for the better.

jdp1024 commented 5 months ago

thanks, so no hope

edit. I see a way if we created a wrapper to run the GUI through, that DLL injects a SetProcessAffinityMask that does nothing, but at this point I don't want to bother. Hopefully this will force some changes for the better.

That's exactly what I've done for CB: DLL hijacking to make SetProcessAffinityMask a ret.

Before the NUMA patch, stockfish was unleashed and didn't need the trick.

Disservin commented 5 months ago

@Sopel97 i haven't followed this in depth, do you mind explaining why the new numa logic fails for cb gui, while the old logic worked ? Can't we fallback to the old logic incase the numa policy is set to none?

Sopel97 commented 5 months ago

@Sopel97 i haven't followed this in depth, do you mind explaining why the new numa logic fails for cb gui, while the old logic worked ? Can't we fallback to the old logic incase the numa policy is set to none?

The old logic was to force bind all threads spread out over all NUMA nodes when there was more than 8 threads. And since it happened to use the same mechanism (SetThreadAffinityMask instead of new Win11 API) it was overriding previous SetProcessAffinityMask calls from the GUIs. It was accidental behaviour. It was overriding affinity, effectively going against the principle of child processes inheriting affinity, which we now fixed.

jdp1024 commented 5 months ago

@Sopel97 i haven't followed this in depth, do you mind explaining why the new numa logic fails for cb gui, while the old logic worked ? Can't we fallback to the old logic incase the numa policy is set to none?

The old logic was to force bind all threads spread out over all NUMA nodes when there was more than 8 threads. And since it happened to use the same mechanism (SetThreadAffinityMask instead of new Win11 API) it was overriding previous SetProcessAffinityMask calls from the GUIs. It was accidental behaviour. It was overriding affinity, effectively going against the principle of child processes inheriting affinity, which we now fixed.

CB uses SetProcessAffinityMask for a "higher-resolution" timer, but I don't think it really needs one.

official-stockfish / Stockfish