debrouxl commented 2 years ago

On an old quad-socket Opteron 6272 server (4x16 cores but every pair of cores shares a FPU), which produced several corrected ECC errors overnight, I tried to use PCMemTest 1.5, under the form of the recently added Debian experimental package, in legacy BIOS mode, and forcing SMP enable at start. Well, pcmemtest had worked (UP and SMP modes) on several 32-bit-only computers recently, but not on this computer :)

first of all, PCMemTest tried to display IDs and state for all 64 cores, thereby writing over the information on the right side of the same lines:
then, PCMemTest didn't make any visible progress for several minutes (enough for me to wait a bit, then go pick up a crappy external USB camera, put it on top of a pile of stuff in the front of the screen, and take the above picture), before I rebooted the server by pressing Esc.
after disabling half of the cores in the BIOS, the lack of visible progress for a while still happens: Before taking this second picture, I used the menu to successfully switch to sequential mode, it didn't help. After taking a picture, I rebooted the server through Esc again: it looks like pcmemtest is not completely hung, it just doesn't start up in a reasonable amount of time.

I can test the latest Git version of pcmemtest later, but for now, memtest86+ 5.31b (again under the form of a Debian experimental package, which contains a bugfix over upstream memtest86+ 5.31b to avoid memory corruption and often a crash on most computers equipped with > 16 RAM slots...) has neither bug: after forcing SMP mode as well, the memory test started properly in 32-core mode: memtest86+_5 31b_ok_smp32

debrouxl commented 2 years ago

In 32-core BIOS setting, a build of current pcmemtest Git HEAD ( 8509e3320aa5f4b9770edae88691c20b4957bc50 ) eventually starts testing in SMP mode, after maybe 5 minutes of waiting in the state shown by the second picture. Both the temperature of the hot air at the back of the server and the fan speed and therefore noise rising even higher indicate that the cores are working / busy waiting :) So maybe you fixed something between pcmemtest 1.5 and current Git HEAD, or maybe I didn't wait for long enough before taking the second picture above. Still - memtest86+ waits for a much shorter period before starting the test.

BTW: current pcmemtest successfully detects and displays multiple OHCI and EHCI USB controllers, as well as the USB keyboard (Logitech K120). However, said keyboard doesn't seem to be powered up: keypresses are not recognized, and the Num Lock LED doesn't light up when pressing the Num Lock button. Fortunately, this server model still has a PS/2 keyboard port, so I could get past the USB enumeration screen and start the test, whereas on another server model, namely Dell PowerEdge R815, which only has USB ports, I can't get past the USB enumeration screen.

martinwhitaker commented 2 years ago

The display corruption. Yes, that's something I need to sort out. Maybe I should set the maximum number of cores used back to 32 for now.
Just to confirm - the long delay on startup only happens when you enable SMP?

If you enable SMP then use the menu to enable only two CPU cores, does that reduce the delay? I expect there to be a delay proportional to the number of CPU cores enabled, but on my machines it is a few seconds, not minutes.

If you don't press a key at the initial startup prompt (not the USB keyboard check) but wait for it to time out, how long does it take to start the tests? (it should be about 5 seconds)

Keyboard LEDs are turned on and off by the host machine software, not the keyboard itself, so aren't a good indication of whether the keyboard is powered. PCMemTest only checks for key presses, so I don't expect the LEDs to work.

USB keyboard support is still work in progress, although having said that, it works on all 7 machines (of varying vintages) I have here to test on. I only have two models of USB keyboard though, both Logitech wireless (one K260, two K270).

Were you using the same K120 keyboard with your Dell server or a different one?

Note that if you are using a bootloader (e.g. GRUB) to start PCMemTest, passing "keyboard=legacy" on the boot command line should skip the check for USB keyboards, which will allow you to use your USB keyboard if your BIOS supports USB legacy keyboard emulation.

debrouxl commented 2 years ago

looking forward, setting the maximum number of used cores to 32 doesn't seem to be a good idea for any memory testing software, considering that as we both know, some recent processor models from both Intel and AMD have > 32 cores, and 96-core EPYC processors are supposedly coming out this year. But in the meantime, the screen corruption is ugly, so I thought I'd report it alongside the excessive startup delay :)
yes, the long startup delay only happens when enabling SMP; the test starts pretty much immediately after the initial F1 / F2 SMP prompt times out. If I enable SMP first, selecting fewer cores later through the menu doesn't fix the lengthy startup delay, whether I select SEQ or RR mode.
I'm using the same K120 with both the Asus RS904A-E6/PS4 and the Dell PowerEdge R815 servers, and an unremarkable PS/2 keyboard for the former. The former supports USB legacy keyboard emulation, and passing keyboard=legacy to pcmemtest enables using the USB keyboard, as should happen.

martinwhitaker commented 2 years ago

What I was intending to do was to limit the number of enabled/active CPUs to the number that could be be displayed (32). By default the first 32 cores would be enabled, but you could select a different subset in the config menu. But I haven't got round to implementing that, mostly because I don't have any machines with that number of cores.
Could you try setting TRACE_BARRIERS to 1 (near the top of app/main.c), then in the config menu enabling Boot trace and seeing what trace messages are output.
If you can lay your hands on another USB keyboard, it would be good to know whether the problem is specific to the K120.

debrouxl commented 2 years ago

even for displaying purposes, I think that all memtest86+-derived programs will eventually have to find a way to display status for > 32 cores, considering the huge core counts that recent 2S/4S/8S computers have :)
& 3. alright, I'll do that another day. I can borrow two other USB keyboards of different models easily.

BTW, the USB controllers are as follows:

ASUS RS904A-E6/PS4:

root@rs904a-1:~# lspci | grep USB
00:12.0 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller
00:12.1 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0 USB OHCI1 Controller
00:12.2 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller
00:13.0 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller
00:13.1 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0 USB OHCI1 Controller
00:13.2 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller
00:14.5 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI2 Controller
root@rs904a-1:~# lspci -n | grep 00:1[234]
00:12.0 0c03: 1002:4397
00:12.1 0c03: 1002:4398
00:12.2 0c03: 1002:4396
00:13.0 0c03: 1002:4397
00:13.1 0c03: 1002:4398
00:13.2 0c03: 1002:4396
00:14.5 0c03: 1002:4399

Dell PowerEdge R815: same, minus the OHCI2 controller (1002:4399).

debrouxl commented 2 years ago

On the ASUS RS904A-E6/PS4:

I gathered the barrier trace you requested, after pressing F2 F1 F7 F10 during the initial timeout to enable SMP and boot trace. It reads:

0  pm 00000000 - 00000092
0  pm 00000100 - 000d3d20
0  pm 000d7d90 - 000d7db4
0  pm 00100000 - 04027000
0  ACPI RDSP found in BIOS reserved data at 000fb690
1  Start barrier wait at ../app/main.c line 390
2  Start barrier wait at ../app/main.c line 390
3  Start barrier wait at ../app/main.c line 390
4  Start barrier wait at ../app/main.c line 390
5  Start barrier wait at ../app/main.c line 390
6  Start barrier wait at ../app/main.c line 390
...
(these barrier traces appear at a rate of several per second, not always in numerical or same order)
31  Start barrier wait at ../app/main.c line 390
(then, the computer waits here for several minutes, as usual, before displaying the "<Enter> Single step" mention)
20  Start barrier wait at ../app/main.c line 407
5  Start barrier wait at ../app/main.c line 407
...
20  Start barrier wait at ../app/main.c line 407
(at this stage, after a dozen traces or so, "Test #0   [Address test, walking ones, no cache]" appears on the top right pane)
23  Start barrier wait at ../app/main.c line 448
(more of the same)
1  Start barrier wait at ../app/main.c line 448
(at this stage, the CPU state appears all W, or sometimes all W but a blank at some ID; another press on Enter and...)
14  Start barrier wait at ../app/main.c line 281
...
12  Start barrier wait at ../app/main.c line 281
12  Start barrier wait at ../app/main.c line 297
...
14  Start barrier wait at ../app/main.c line 297
14  Start barrier wait at ../app/main.c line 326
...
12  Start barrier wait at ../app/main.c line 326
(at this stage, "Testing: 0KB - 584 KB    584 KB of 255 GB" + "Pattern: 0xffffffff" + "Time: 0:0x:xx" appear)
12  Start barrier wait at ../app/main.c line 281
...
0  Start barrier wait at ../app/main.c line 297
...
0  Start barrier wait at ../app/main.c line 326
...
<repeat>

The line numbers ought to match the Git HEAD version: I built pcmemtest by using the Debian experimental pcmemtest 1.5 source package as a starting point, and replacing the source code by the Git HEAD version.

while the USB controllers are detected and enumerated properly as previously mentioned, when not relying on USB legacy emulation, no USB keyboard is detected at all, be it the previously mentioned Logitech K120 QWERTY keyboard or a HP KU-0316 AZERTY keyboard. I think that the end of my second comment in this issue described the Dell PowerEdge R815's behaviour (USB keyboard detected but not working), not the ASUS RS904A-E6/PS4's behaviour.

martinwhitaker commented 2 years ago

Please try the latest code from git. It should now handle up to 256 cores, although I have no way of testing more than 16. I've also made a change which I hope will cure the long start-up time on your ASUS server.

I've not yet looked at the keyboard problem.

debrouxl commented 2 years ago

Test results for a build of ab278566e4bc62215606b3654ca8708d77d9947d :

on the ASUS RS904A in BIOS-configured 32 core, no SMT mode, your latest commits have reduced the start-up time (pcmemtest SMP mode, as usual) to less than 30s... though it's not great yet (over an order of magnitude longer than memtest86+), it's much better indeed :) I let pcmemtest proceed for several minutes, with both single-core and all core tests seemingly working.
on the Dell Poweredge R815 in standard 64-thread SMT mode:
- pcmemtest started with the "keyboard=legacy smp" options displayed "CPU cores: 64 available, 64 enabled" for a split second... then the screen goes blank, and the server reboots. IOW, pcmemtest crashed. The behaviour is reproducible.
- pcmemtest started with the "keyboard=legacy" option started normally and survived until I pressed the computer's power several dozen seconds later.

EDIT: oh, and I forgot to state that

raising the core limit from 32 to 256 is a good result. Sure, 512 would be better (there are 8S-capable, 56-thread models in the Xeon Cascade Lake series, and AMD's 192-thread EPYCs shall become available this year - I still hope that AMD will decide to make 4S-capable Zen4+ models, even), but I see that there are stack space constraints...
the new way to display core activity, besides being quite future-proof, makes sense to me.

Sorry.

martinwhitaker commented 2 years ago

Did the Dell R815 run OK with 64 threads before I made the recent changes? If so, could you try doubling the BSP_STACK_SIZE and AP_STACK_SIZE values (near the top of boot/boot.h) to see if that makes a difference. I did measure the stack usage on my machine before setting those values, and it ought to be deterministic, but that's the only change I made that I think would lead to a crash.

I'm working on speeding up the AP startup time. But the big problem with your ASUS machine is in the spinlocks used to synchronise the threads, which will be down to inter-socket cache coherency.

debrouxl commented 2 years ago

On the Dell R815, yesterday was the first time I ran a SMP test with pcmemtest > 1.5 beyond the USB enumeration screen, by passing "keyboard=legacy smp" to pcmemtest, so I have no comparison against 8509e3320aa5f4b9770edae88691c20b4957bc50 or older.

Shouldn't app/test.h::MAX_VCPUS be >= system/smp.h::MAX_PCPUS ?

I'll make fresh tests tomorrow evening, or later.

martinwhitaker commented 2 years ago

Ah, good catch! That's left over from the old way of handling more cores than could be displayed. I'll fix that.

debrouxl commented 2 years ago

These test results apply to a modified build of ab278566e4bc62215606b3654ca8708d77d9947d , tests made before 10 p.m. CET (I posted this message over 2 hours later).

On the PowerEdge R815, MAX_VCPUS = 256 and MAX_PCPUS = 256 crashes all the same. So do { MAX_VCPUS = 64, MAX_PCPUS = 64, AP_STACK_SIZE = 2048 } and { MAX_VCPUS = 32, MAX_PCPUS = 32 and AP_STACK_SIZE = 2048 }. I forgot to increase BSP_STACK_SIZE before packing up for tonight.

However, since I wanted to test MAX_VCPUS = 512, MAX_PCPUS = 512 and AP_STACK_SIZE=512, all of my builds from tonight also contain a non-default value for the config popup's POP_W, raised from 36 to 68, which impacted static uint16_t popup_save_buffer[POP_W * POP_H]. I'll cancel that later as well, but given that this change doesn't blow up the size of the BSS, I feel that it's unlikely to be an auxiliary crash cause.

debrouxl commented 2 years ago

Test results for a clean build of current HEAD 16d55b7dad4dc5c2483ded30d51d8b2f792a4d08 (no more config popup size change):

on the R815 in 64 core (SMT) mode, pcmemtest "keyboard=legacy smp" displays 3.48 GB of memory instead of the usual 255 GB, and the computer spontaneously reboots before displaying the physical and available core counts. Test repeated twice;
on the RS904 in 32 core (no SMT) mode, pcmemtest "smp" with a PS/2 keyboard displays the usual 255 GB of memory, and the tests start normally, a bit less slowly than before (still around 20s). F1 F5 F10 F1 F5 F10 to skip the two first tests works, the parallel test survives for several minutes - I'll update this comment after test number 4 is reached, which should take over a dozen more minutes, if progress is linear. [EDIT: eh, test number 3 reached the end of the RAM at "Test 50%" and switched to test number 4 immediately ?]

Could cache line alignment for each of start_barrier, run_barrier, start_mutex and error_mutex help, or does the code depend on these being very close to each other (IIUC smp.c) ? [EDIT: nope, doesn't seem to help start up time, or have obvious ill effects, either.]

martinwhitaker commented 2 years ago

Are you testing a 32-bit or a 64-bit build? Is this legacy or UEFI boot?

FWIW, I'm not seeing any problems when using qemu to emulate 64 cores.

There's another potential optimisation to come. The code does a dummy run through the tests to calculate the number of "ticks" a test produces (which is used to scale the progress bars). I'll try to do that before starting the APs.

Changing the barrier implementation so each core waits on a flag in a different cache line might also help.

Note that 256 cores is the limit when using the generic algorithm for starting the APs described in the Intel Multiprocessor specification - the APIC ID is only 8 bits. I've not researched how you are supposed to handle machines with more than 256 cores, and given that I'm never likely to get my hands on such a machine, it's not high on my priority list.

debrouxl commented 2 years ago

The Dell R815 uses UEFI boot and a 64-bit binary, while the RS904A only supports legacy boot and therefore uses a 32-bit pcmemtest binary.

Understood about the 256 core limit :) By the end of 2023, it should be possible to rent dedicated 2S EPYC Zen 4 computers with 96 cores / 192 threads each for a hourly or daily fee, and some of the companies offering such services should even provide KVM services, but that won't be particularly cheap. Looks like the solution to the 256 core limit issue is available from Nehalem onwards in Intel processors ("x2APIC", CPUID 1, ecx register, bit 21 & CPUID B & CPUID 1F preferred to B if available), and more or less differently on AMD processors (at least extended APIC space, with two CPUID-based methods depending on the generation, at least one of which is already available in my Opteron 6272; a Zen 3 processor I have access to supports both x2APIC and AMD's method).

martinwhitaker commented 2 years ago

The latest code should give a significant improvement in startup time on your ASUS machine. I've also made some changes that might fix the crash on the Dell machine. Needless to say, it works fine on my machines, but YMMV.

debrouxl commented 2 years ago

Test result for dcac5270687b5cd991f69ac44b9fbc78d6a5fbca :

on the R815, pcmemtest still crashes, but at a later stage than before: 64 / 64 cores is displayed again, and about a dozen seconds later, the server reboots. So that's progress :) I've been trying to add traces into the code ( 7acb81e302f932760ccdbccb20b570fbb1891249 , I see that you made a new commit since I fetched that) and filming the output to help debugging the issue:
```
diff --git a/app/config.c b/app/config.c
index 2052876..f522384 100644
--- a/app/config.c
+++ b/app/config.c
@@ -115,6 +115,10 @@ static void parse_option(const char *option, const char *params)
     pause_at_start = false;
 } else if (strncmp(option, "smp", 4) == 0) {
     smp_enabled = true;
```
} else if (strncmp(option, "trace", 5) == 0) {
enable_trace = true;
} else if (strncmp(option, "temperature", 11) == 0) {
enable_temperature = true; } }

diff --git a/app/display.c b/app/display.c index 7fe82f4..90ad9b8 100644 --- a/app/display.c +++ b/app/display.c @@ -20,6 +20,7 @@

include "error.h"

include "tests.h"

+#include "unistd.h"

include "display.h"

@@ -305,4 +306,5 @@ void do_trace(int my_cpu, const char *fmt, ...) vprintf(scroll_message_row, 4, fmt, args); spin_unlock(error_mutex); va_end(args);

usleep(200000); } diff --git a/app/main.c b/app/main.c index 1e63e85..bc7de95 100644 --- a/app/main.c +++ b/app/main.c @@ -220,7 +220,7 @@ static void global_init(void) if (enable_trace) { display_pinned_message(0, 0,"CPU Trace"); display_pinned_message(1, 0,"--- ----------------------------------------------------------------------------");
set_scroll_lock(true);
set_scroll_lock(false); }

size_t program_size = (_stacks - _start) + BSP_STACK_SIZE + (num_enabled_cpus - 1) * AP_STACK_SIZE; @@ -351,7 +351,7 @@ static void test_all_windows(int my_cpu) }
```
 if (i_am_master) {
```
trace(my_cpu, "start window %i", window_num);
//trace(my_cpu, "start window %i", window_num); switch (window_num) { case 0: window_start = 0; @@ -441,6 +441,7 @@ void main(void)

while (1) { BARRIER;
trace(my_cpu, "Main loop iteration"); if (my_cpu == 0) { if (start_run) { pass_num = 0; @@ -482,10 +483,12 @@ void main(void) rerun_test = false; } BARRIER;
trace(my_cpu, "Before test_all_windows"); if (test_list[test_num].enabled) { test_all_windows(my_cpu); } BARRIER;
trace(my_cpu, "After test_all_windows"); if (my_cpu != 0) { continue; } @@ -500,14 +503,17 @@ void main(void) continue; } error_update();
trace(my_cpu, "After error_update");
```
 if (test_list[test_num].enabled) {
```
trace(my_cpu, "Test enabled"); if (++test_stage < test_list[test_num].stages) { rerun_test = true; continue; } test_stage = 0;
trace(my_cpu, "Before switch"); switch (cpu_mode) { case PAR: if (test_list[test_num].cpu_mode == SEQ) { @@ -531,35 +537,43 @@ void main(void) default: break; }

trace(my_cpu, "After switch"); }

 if (dummy_run) {
     ticks_per_pass[pass_num] += ticks_per_test[pass_num][test_num];
 }

trace(my_cpu, "start_test"); start_test = true; test_num++; if (test_num < NUM_TEST_PATTERNS) { continue; }
trace(my_cpu, "pass_num"); pass_num++; if (dummy_run && pass_num == NUM_PASS_TYPES) { start_run = true; dummy_run = false;
trace(my_cpu, "Before barrier_init"); barrier_init(start_barrier, num_enabled_cpus);
trace(my_cpu, "After barrier_init"); int failed = smp_start(cpu_state);
trace(my_cpu, "After smp_start"); if (failed) { const char *message = "Failed to start CPU core %i. Press any key to reboot..."; display_notice_with_args(strlen(message), message, failed); while (get_key() == 0) { } reboot(); }
trace(my_cpu, "Not failed"); init_state = 2; BARRIER; continue; }
trace(my_cpu, "start_pass"); start_pass = true; if (!dummy_run) { display_pass_count(pass_num);
```
but this hasn't produced a result yet...
```
I probably can't test on the RS904A today, it's still running memtes86+ for testing a batch of RAM sticks.

debrouxl commented 2 years ago

+            trace(my_cpu, "Before barrier_init");
             barrier_init(start_barrier, num_enabled_cpus);
+            trace(my_cpu, "After barrier_init");
             int failed = smp_start(cpu_state);
+            trace(my_cpu, "After smp_start");

Before the reboot, the video capture shows the two first added traces, but not the third one. This probably confirms that something crashes when enabling the SMP mode. No more tests until tomorrow evening at best.

martinwhitaker commented 2 years ago

Maybe the BIOS has programmed the local APIC to operate in x2APIC mode

debrouxl commented 2 years ago

ISTR that on both the R815 and RS904A, /proc/cpuinfo shows APIC IDs below 256 for all processors. The 6272 Bulldozer and 6378 Piledriver Opterons have support for AMD Extended APIC space but do not advertise the x2APIC bit in CPUID 1 ecx. AMD processors from Bulldozer onwards are supposed to set CPUID 80000001 ecx bit 22 (topology extensions) and provide the 32-bit extended APIC ID in CPUID 8000001E eax.

debrouxl commented 2 years ago

R815 test result for 13efd906f101a5758235145c46ab445a3656d7c3 modified as diffed below to add traces and align allocations in the SMP page (the externally observable behaviour without this alignment code is the same with or without the alignment anyway): the computer crashes when starting the first AP, after sending the first startup IPI. https://user-images.githubusercontent.com/8375/152431250-983e7afb-b2a1-4eac-8a56-105a1e15c72d.mp4 .

diff --git a/app/config.c b/app/config.c
index 2052876..f522384 100644
--- a/app/config.c
+++ b/app/config.c
@@ -115,6 +115,10 @@ static void parse_option(const char *option, const char *params)
         pause_at_start = false;
     } else if (strncmp(option, "smp", 4) == 0) {
         smp_enabled = true;
+    } else if (strncmp(option, "trace", 5) == 0) {
+        enable_trace = true;
+    } else if (strncmp(option, "temperature", 11) == 0) {
+        enable_temperature = true;
     }
 }

diff --git a/app/display.c b/app/display.c
index 7fe82f4..94ac744 100644
--- a/app/display.c
+++ b/app/display.c
@@ -20,6 +20,7 @@
 #include "error.h"

 #include "tests.h"
+#include "unistd.h"

 #include "display.h"

@@ -85,7 +86,7 @@ void display_init(void)
     prints(5, 0, "Memory  : N/A               | Pattern: ");
     prints(6, 0, "-------------------------------------------------------------------------------");
     prints(7, 0, "CPU cores:     available,     enabled  | Time:            Temperature: N/A ");
-    prints(8, 0, "Run mode : PAR    Using:               | Pass:            Errors: ");
+    prints(8, 0, "Run mode : GAR    Using:               | Pass:            Errors: ");
     prints(9, 0, "-------------------------------------------------------------------------------");

     // Redraw lines using box drawing characters.
@@ -305,4 +306,5 @@ void do_trace(int my_cpu, const char *fmt, ...)
     vprintf(scroll_message_row, 4, fmt, args);
     spin_unlock(error_mutex);
     va_end(args);
+    usleep(200000);
 }
diff --git a/app/main.c b/app/main.c
index cb1263d..9f76029 100644
--- a/app/main.c
+++ b/app/main.c
@@ -220,7 +220,7 @@ static void global_init(void)
     if (enable_trace) {
         display_pinned_message(0, 0,"CPU Trace");
         display_pinned_message(1, 0,"--- ----------------------------------------------------------------------------");
-        set_scroll_lock(true);
+        set_scroll_lock(false);
     }

     size_t program_size = (_stacks - _start) + BSP_STACK_SIZE + (num_enabled_cpus - 1) * AP_STACK_SIZE;
@@ -351,7 +351,7 @@ static void test_all_windows(int my_cpu)
         }

         if (i_am_master) {
-            trace(my_cpu, "start window %i", window_num);
+            //trace(my_cpu, "start window %i", window_num);
             switch (window_num) {
               case 0:
                 window_start = 0;
@@ -448,6 +448,7 @@ void main(void)

     while (1) {
         BARRIER;
+        //trace(my_cpu, "Main loop iteration");
         if (my_cpu == 0) {
             if (start_run) {
                 pass_num = 0;
@@ -489,10 +490,12 @@ void main(void)
             rerun_test = false;
         }
         BARRIER;
+        //trace(my_cpu, "Before test_all_windows");
         if (test_list[test_num].enabled) {
             test_all_windows(my_cpu);
         }
         BARRIER;
+        //trace(my_cpu, "After test_all_windows");
         if (my_cpu != 0) {
             continue;
         }
@@ -507,14 +510,17 @@ void main(void)
             continue;
         }
         error_update();
+        //trace(my_cpu, "After error_update");

         if (test_list[test_num].enabled) {
+            //trace(my_cpu, "Test enabled");
             if (++test_stage < test_list[test_num].stages) {
                 rerun_test = true;
                 continue;
             }
             test_stage = 0;

+            //trace(my_cpu, "Before switch");
             switch (cpu_mode) {
               case PAR:
                 if (test_list[test_num].cpu_mode == SEQ) {
@@ -538,35 +544,43 @@ void main(void)
               default:
                 break;
             }
+            //trace(my_cpu, "After switch");
         }

         if (dummy_run) {
             ticks_per_pass[pass_num] += ticks_per_test[pass_num][test_num];
         }

+        trace(my_cpu, "start_test");
         start_test = true;
         test_num++;
         if (test_num < NUM_TEST_PATTERNS) {
             continue;
         }

+        trace(my_cpu, "pass_num");
         pass_num++;
         if (dummy_run && pass_num == NUM_PASS_TYPES) {
             start_run = true;
             dummy_run = false;
+            trace(my_cpu, "Before barrier_init");
             barrier_init(start_barrier, num_enabled_cpus);
+            trace(my_cpu, "After barrier_init");
             int failed = smp_start(cpu_state);
+            trace(my_cpu, "After smp_start");
             if (failed) {
                 const char *message = "Failed to start CPU core %i. Press any key to reboot...";
                 display_notice_with_args(strlen(message), message, failed);
                 while (get_key() == 0) { }
                 reboot();
             }
+            trace(my_cpu, "Not failed");
             init_state = 2;
             BARRIER;
             continue;
         }

+        trace(my_cpu, "start_pass");
         start_pass = true;
         if (!dummy_run) {
             display_pass_count(pass_num);
diff --git a/system/smp.c b/system/smp.c
index c9aa869..0d4723f 100644
--- a/system/smp.c
+++ b/system/smp.c
@@ -25,6 +25,8 @@
 #include "unistd.h"

 #include "smp.h"
+#include "config.h"
+#include "display.h"

 //------------------------------------------------------------------------------
 // Constants
@@ -122,6 +124,8 @@

 #define AP_TRAMPOLINE_PAGE          (smp_heap_page)

+#define CACHE_LINE_SIZE             (2 * 64)
+
 //------------------------------------------------------------------------------
 // Types
 //------------------------------------------------------------------------------
@@ -618,24 +622,32 @@ static bool start_cpu(int cpu_num)
     int apic_id = cpu_num_to_apic_id[cpu_num];

     // Pulse the INIT IPI.
+    trace(0, "Will send INIT IPI for CPU %i apic_id %i", cpu_num, apic_id);
     apic_write(APICR_ESR, 0);
     if (!send_ipi(apic_id, APIC_TRIGGER_LEVEL, 1, APIC_DELMODE_INIT, 0)) {
+        trace(0, "Failed to send INIT IPI 1 for CPU %i", cpu_num);
         return false;
     }
     if (!send_ipi(apic_id, APIC_TRIGGER_LEVEL, 0, APIC_DELMODE_INIT, 0)) {
+        trace(0, "Failed to send INIT IPI 0 for CPU %i", cpu_num);
         return false;
     }
+    trace(0, "Sent INIT IPIs for CPU %i", cpu_num);

     usleep(10000);

     // Send two STARTUP_IPIs.
     for (int num_sipi = 0; num_sipi < 2; num_sipi++) {
+        trace(0, "Sending STARTUP IPI %i for CPU %i", num_sipi, cpu_num);
         apic_write(APICR_ESR, 0);
         if (!send_ipi(apic_id, 0, 0, APIC_DELMODE_STARTUP, AP_TRAMPOLINE_PAGE)) {
+            trace(0, "Failed to send STARTUP IPI %i for CPU %i", num_sipi, cpu_num);
             return false;
         }
+        trace(0, "Sent STARTUP IPI %i for CPU %i", num_sipi, cpu_num);
         usleep(200);
     }
+    trace(0, "Has started CPU %i", cpu_num);

     return true;
 }
@@ -673,7 +685,7 @@ void smp_init(bool smp_enable)
     size_t ap_trampoline_size = ap_trampoline_end - ap_trampoline;
     memcpy((uint8_t *)HEAP_BASE_ADDR, ap_trampoline, ap_trampoline_size);

-    alloc_addr = HEAP_BASE_ADDR + ap_trampoline_size;
+    alloc_addr = (HEAP_BASE_ADDR + ap_trampoline_size + CACHE_LINE_SIZE) & ~((uintptr_t)CACHE_LINE_SIZE - 1);
 }

 int smp_start(cpu_state_t cpu_state[MAX_CPUS])
@@ -684,23 +696,29 @@ int smp_start(cpu_state_t cpu_state[MAX_CPUS])

     for (cpu_num = 1; cpu_num < num_available_cpus; cpu_num++) {
         if (cpu_state[cpu_num] == CPU_STATE_ENABLED) {
+            trace(0, "Will start CPU %i", cpu_num);
             if (!start_cpu(cpu_num)) {
+                trace(0, "Failed to start CPU %i", cpu_num);
                 return cpu_num;
             }
+            trace(0, "Started CPU %i", cpu_num);
         }
     }

+    trace(0, "timeout");
     int timeout = 100000;
     while (timeout > 0) {
         for (cpu_num = 1; cpu_num < num_available_cpus; cpu_num++) {
             if (cpu_state[cpu_num] == CPU_STATE_ENABLED) break;
         }
         if (cpu_num == num_available_cpus) {
+            trace(0, "return 0");
             return 0;
         }
         usleep(10);
         timeout--;
     }
+    trace(0, "return cpu_num %i", cpu_num);
     return cpu_num;
 }

@@ -712,7 +730,7 @@ int smp_my_cpu_num(void)
 barrier_t *smp_alloc_barrier(int num_threads)
 {
     barrier_t *barrier = (barrier_t  *)(alloc_addr);
-    alloc_addr += sizeof(barrier_t);
+    alloc_addr += CACHE_LINE_SIZE;
     barrier_init(barrier, num_threads);
     return barrier;
 }
@@ -720,7 +738,7 @@ barrier_t *smp_alloc_barrier(int num_threads)
 spinlock_t *smp_alloc_mutex()
 {
     spinlock_t *mutex = (spinlock_t *)(alloc_addr);
-    alloc_addr += sizeof(spinlock_t);
+    alloc_addr += CACHE_LINE_SIZE;
     spin_unlock(mutex);
     return mutex;
 }

The "PAR" -> "*AR" string change, with first letter incremented every time, is an easy way for me to check that I installed the latest version before rebooting the computer...

martinwhitaker commented 2 years ago

I've been looking at the Linux code that starts the APs, and that does a bit more than the "universal algorithm". I'll try to make PCMemTest do the same, but probably won't have time until the weekend.

debrouxl commented 2 years ago

Alright. I recently saw that the Linux code for starting the APs is currently undergoing optimization work, providing interesting speedups for larger computers: https://lwn.net/Articles/883466/ .

martinwhitaker commented 2 years ago

At a quick glance, that looks to be the same optimisation I did in commit 17093a9.

debrouxl commented 2 years ago

Good, then.

I found that pcmemtest 13efd906f101a5758235145c46ab445a3656d7c3 reliably explodes at start (enters app/interrupt.c::interrupt()) in SMP mode on one of the other computers I have access to, and I didn't notice that behaviour with pcmemtest 1.5. Maybe it was just luck not to hit it with 1.5 (say, if the problem happens randomly), but I'll make sure to test your improved SMP startup code on that computer as well, since it currently exhibits improper behaviour :)

That computer is probably the most powerful 32-bit-only computer I have access to, an IBM xSeries 345 server with 2 x Socket 604 single core HT-capable P4 Xeon "Gallatin" and DDR1 RAM. No recent CPUID features or x2APIC there.

martinwhitaker commented 2 years ago

Please try the latest code. I've made the code that starts the APs match what's done in Linux. I've also fixed another bug introduced in my efforts to eliminate the startup delay.

debrouxl commented 2 years ago

Test results for 3028cd31f99e676f1eba9df50b62e5ed50c4605d + my two-liner for enabling trace at startup, I made sure that I'm using a build from tonight (I temporarily removed the "PAR" -> "*AR" change):

the R815 (UEFI) still reboots a dozen seconds after displaying 64 available / 64 enabled cores;
the RS904A (legacy BIOS) starts the test alright, in 32 / 32 core mode as usual, with a delay of a bit less than 20s. I let the test run until test number 1 (Address test, own address in window) reached the second core, which took 4+ minutes; then, I skipped to test number 3, and let it run for a dozen GBs, without experiencing improper behaviour.

Sorry, it looks like I can only make a single round of tests on that pair of computers tonight. The RAID array of the R815 seems to be re-synchronizing (at the OS level, zero OS disk activity at rest, and slow disk performance when doing a bit of IO, whereas the HDD LEDs have been permanently lit for a couple dozen minutes), otherwise I could have made a second one with my re-enabled tracing code.

I can still try to make a test on the xSeries 345 later tonight, though.

debrouxl commented 2 years ago

A build of 3028cd31f99e676f1eba9df50b62e5ed50c4605d keeps crashing on the xSeries 345 as well, entering interrupt() from a random core at a random interval within several seconds of the memory test starting. I mean that I've seen crashes during test numbers 0, 1 and 3.

The sequence of bytes printed by the interrupt handler is 83 c1 08 eb e2 5b 5e 5f c3 56 31 c0; there's a single occurrence of that sequence in the binary, which immediately follows a cc byte - an unmistakable instruction. Examining the ELF version of the code shows that this sequence of bytes belongs to the end of do_relocations() and the beginning of reloc(); the source code shows that this comes from the assert in do_relocation(), which is getting inlined into do_relocations().

I could buy that GCC 11 has decided to produce a new relocation type whose handling needs to be added, but I can't buy that relocations magically and randomly appear during execution. I have a hard time convincing myself that the RAM has become faulty: a recent memtest86+ run hasn't detected relevant issues (I've launched a fresh SMP one nevertheless, it will probably hit the usual memtest86+ SMP bugs but that doesn't count), and neither various versions of Linux over time, nor an occasional FreeDOS boot a while ago to install FreeDOS and proceed to upgrade the BIOS, have crashed.

I think that something is corrupting memory somehow. Possibly relatedly, I remember that Linux has recently simply quit using the first megabyte of RAM to avoid some BIOS intereference ( https://www.phoronix.com/scan.php?page=news_item&px=Linux-Always-Reserve-1MB , etc.).

martinwhitaker commented 2 years ago

As you didn't see the problem on the xSeries 345 with pcmemtest 1.5, I would assume it's one of my recent changes that has caused it. Could you try this modification:

--- a/app/main.c
+++ b/app/main.c
@@ -150,7 +150,7 @@ static bool set_load_addr(uintptr_t *load_addr, size_t program_size, uintptr_t l
     for (int i = 0; i < pm_map_size; i++) {
         uintptr_t try_start = pm_map[i].start << PAGE_SHIFT;
         uintptr_t try_limit = pm_map[i].end   << PAGE_SHIFT;
-        if (try_start == 0) try_start = 0x1000;
+        if (try_start < 0x10000) try_start = 0x10000;
         uintptr_t try_end   = try_start + program_size;
         if (try_end > try_limit) continue;

debrouxl commented 2 years ago

Hmm. More testing warranted on the xSeries 345, but I won't be able to do it tonight...

3028cd31f99e676f1eba9df50b62e5ed50c4605d + your suggested fix has survived three times in a row, which is better than I could obtain on 13efd906f101a5758235145c46ab445a3656d7c3. A multi-bit error sprang out at pass 4, after over 2 hours of testing, at an address a bit beyond 0x100000, but that's not very believable given that the previous passes with pcmemtest, as well as similarly long passes with memtest86+ earlier today, didn't produce that;
I can confirm that 13efd906f101a5758235145c46ab445a3656d7c3 explodes (hangs, enters interrupt handler);
I'll need to test 3028cd31f99e676f1eba9df50b62e5ed50c4605d again without your suggested fix.

debrouxl commented 2 years ago

The sole 9769990679ab1200fe7563b066a120187f2fb5da bugfix to 13efd906f101a5758235145c46ab445a3656d7c3 has survived fo minutes 4 times in a row on the xSeries 345. Therefore, it looks like my tests from two days ago weren't actually on 3028cd31f99e676f1eba9df50b62e5ed50c4605d , e.g. because I did not properly update the source tree or something silly along those lines... sorry. I'll test unpatched 3028cd31f99e676f1eba9df50b62e5ed50c4605d on the xSeries 345 another day.

martinwhitaker commented 2 years ago

OK, thanks for continuing to test this. Long term I plan to stop using low memory anyway (other than testing it), as that's one of the blockers for supporting more than 256 cores.

debrouxl commented 2 years ago

A proper build of 3028cd31f99e676f1eba9df50b62e5ed50c4605d has survived boot and a bit of test 4 times in a row on the xSeries 345. I let one of the runs proceed for a while longer, it hung after ~40' during the second pass at test number 10; that's probably not reproducible.

I'll attempt to add another HDD into the PowerEdge R815, switch it to legacy BIOS mode, install a legacy BIOS distro on it, and test pcmemtest in legacy BIOS mode: this way, I'll see whether it crashes the same way as the UEFI mode pcmemtest.

debrouxl commented 2 years ago

Aha... in legacy BIOS mode, with the usual "keyboard=legacy smp" command line parameters, pcmemtest 3028cd31f99e676f1eba9df50b62e5ed50c4605d ("PAR" customized to "AAR") starts properly, about a dozen seconds after the "64 available, 64 enabled" message appears, and survives 5+ minutes (the time to write this followup) on the R815. I can use the keyboard to enter the menu, skip tests, and reboot the computer. I've triggered a reboot, will reconfigure to UEFI mode and test again. EDIT: as expected, the R815 in UEFI mode reboots as usual, ~8s after the "64 available, 64 enabled" message appears.

martinwhitaker commented 2 years ago

As you are passing options on the boot command line, I assume you are using an intermediate bootloader to start pcmemtest. Is that GRUB, and if so, is it an unmodified upstream version, or has it been patched to use the Linux EFI stub loader? (Fedora and Mageia do this, I don't know about other distributions)

debrouxl commented 2 years ago

Yup, that's GRUB, like on nearly any modern Linux distro:

# dpkg -l | grep grub
ii  grub-common                             2.06-2                            amd64        GRand Unified Bootloader (common files)
ii  grub-efi-amd64                          2.06-2                            amd64        GRand Unified Bootloader, version 2 (EFI-AMD64 version)
ii  grub-efi-amd64-bin                      2.06-2                            amd64        GRand Unified Bootloader, version 2 (EFI-AMD64 modules)
ii  grub-efi-amd64-signed                   1+2.06+2                          amd64        GRand Unified Bootloader, version 2 (amd64 UEFI signed by Debian)
ii  grub2-common                            2.06-2                            amd64        GRand Unified Bootloader (common files for version 2)

The corresponding megabyte-sized (!) Debian packaging XZ tarball ( http://deb.debian.org/debian/pool/main/g/grub2/grub2_2.06-2.debian.tar.xz ) contains lots of things, among which debian/patches/linuxefi.patch. And anyway, the GRUB menu entry for launching pcmemtest is as follows:

menuentry "Memory test (pcmemtest)" {
        insmod part_gpt
        insmod ext2
        set root='hd0,gpt2'
        if [ x$feature_platform_search_hint = xy ]; then
          search --no-floppy --fs-uuid --set=root --hint-bios=hd0,gpt2 --hint-efi=hd0,gpt2 --hint-baremetal=ahci0,gpt2  <UUID>
        else
          search --no-floppy --fs-uuid --set=root <UUID>
        fi
        linuxefi        /pcmemtestx64.efi
}

martinwhitaker commented 2 years ago

Well, with Mageia we have the choice of GRUB or rEFInd (my doing - I consider rEFInd to be far superior).

Thankfully Debian haven't obscured the issue (unlike Fedora, who have removed the upstream "linux" command and made "linux" an alias of "linuxefi"). So this tells me which of the multiple paths through the PCMemTest boot code you are taking. I'm guessing the problem lies in translating the UEFI memory map passed to PCMemTest into the set of memory regions it is allowed to use. I've spotted one thing wrong in that code and pushed a fix for it. It may or may not help - it hasn't made a difference on any of my machines, which are mostly UEFI. If it doesn't help, could you enable tracing and capture the "pm" trace info displayed, and for comparison provide the e820 map info reported by Linux in the system log on that machine.

Other changes I've pushed include tracing for AP startup. On one of your machines where it takes many seconds for PCMemTest to start testing, could you enable tracing, and see where the time is being spent. If you disable scroll lock at the start, it will run through to where all the APs have started and then re-enable scroll lock. At that point it should immediately start testing when you release the scroll lock again.

There's also a change that halts CPU cores when they are not active in a test. This may make the sequential tests run a bit slower, but saves considerable power. It can be disabled by adding the "nohalt" option on the boot command line.

debrouxl commented 2 years ago

I'm sorry, but the latest e820 fix @ f4a413cf6df7dcfcbbee49b7ee8738302ebc3b80 doesn't help, the R815 keeps rebooting at the same spot as before... The pm info as seen by pcmemtest (as usual, when force enabling trace at boot, since they keyboard does not work) is:

pm     10 -      9f
pm    100 -   dae5a
pm  db05a -   db159
pm  db15e -   df222
pm  df252 -   df279
pm  df68f -   df6ce
pm 100000 - 401f000
RSDP db15d000

whereas the Linux kernel log contains:

BIOS-provided physical RAM map:
BIOS-e820: [mem 0x0000000000010000-0x000000000009ffff] usable
BIOS-e820: [mem 0x0000000000100000-0x00000000dae59fff] usable
BIOS-e820: [mem 0x00000000dae5a000-0x00000000daf59fff] type 20
BIOS-e820: [mem 0x00000000daf5a000-0x00000000db059fff] reserved
BIOS-e820: [mem 0x00000000db05a000-0x00000000db158fff] usable
BIOS-e820: [mem 0x00000000db159000-0x00000000db15dfff] ACPI NVS
BIOS-e820: [mem 0x00000000db15e000-0x00000000db1ddfff] ACPI data
BIOS-e820: [mem 0x00000000db1de000-0x00000000df221fff] usable
BIOS-e820: [mem 0x00000000df252000-0x00000000df278fff] usable
BIOS-e820: [mem 0x00000000df679000-0x00000000df68efff] reserved
BIOS-e820: [mem 0x00000000df68f000-0x00000000df6cdfff] ACPI data
BIOS-e820: [mem 0x00000000df6ce000-0x00000000dfffffff] reserved
BIOS-e820: [mem 0x00000000f0000000-0x00000000f3ffffff] reserved
BIOS-e820: [mem 0x00000000fe000000-0x00000000fec8ffff] reserved
BIOS-e820: [mem 0x00000000fec94000-0x00000000feccffff] reserved
BIOS-e820: [mem 0x00000000fecd4000-0x00000000ffffffff] reserved
BIOS-e820: [mem 0x0000000100000000-0x000000401effffff] usable

IOW, pcmemtest considers ACPI data usable, which matches the E820_ACPI = 3, // usable as RAM once ACPI tables have been read comment in boot/bootparams.h.

Now to test AP tracing on a RS904A...

martinwhitaker commented 2 years ago

That's OK, I didn't think it was a likely fix, but it was worth a try.

ACPI data regions should be usable once you no longer need the ACPI tables. I'll have to review the code to see if there's any way it could be used before smp_init() is called, although if you could try the following patch, that ought to show if it is the problem.

--- a/boot/efisetup.c
+++ b/boot/efisetup.c
@@ -490,7 +490,7 @@ static void set_e820_map(boot_params_t *params)
         e820_type_t e820_type = E820_RESERVED;
         switch (mem_desc->type) {
           case EFI_ACPI_RECLAIM_MEMORY:
-            e820_type = E820_ACPI;
+//            e820_type = E820_ACPI;
             break;
           case EFI_LOADER_CODE:
           case EFI_LOADER_DATA:

debrouxl commented 2 years ago

Ah, I see your message too late to be able to perform another test tonight. Before switching to the RS904A, I had made one last test commenting out the E820_ACPI handling in https://github.com/martinwhitaker/pcmemtest/blob/master/system/pmem.c#L198 : it doesn't help.

On the RS904A in 32 core mode, the new tracing code shows that the cause of delay at startup is overwhelmingly test 0. There are delays of multiple seconds after both the first and the second occurrence of the corresponding start trace; when releasing the scroll lock at the end of the AP startup, test 0 does indeed start immediately, with a third occurrence of the start trace, but that one causes no delay.

martinwhitaker commented 2 years ago

Thanks. The test you did would achieve the same result as my patch, so no need for another test, So time to think again.

And I now realise why the startup is slow, Test 0 is a test with the cache disabled - and I've neglected to bypass that when doing the dummy runs!

debrouxl commented 2 years ago

Woo, I can confirm that the startup slowness is now fixed on the RS904A, be it in 32-core or in 64-core mode. Good job :) The "32 available, 32 enabled" / "64 available, 64 enabled" appears ~3s after the rest of the UI appears, i.e. pretty much the initial delay. That's similar to memtest86+.

Given that both of the issues I reported in the first post are fixed, and that there are many comments here, I guess I should close this issue, and create new ones for:

the USB keyboards being unusable on both the RS904A and R815;
pcmemtest rebooting the R815.

WDYT ?

martinwhitaker commented 2 years ago

Thanks for confirming the fix.

I plan to migrate the codebase to https://github.com/memtest86plus this weekend. So, once that's done, I suggest opening new issues there for the remaining problems.

debrouxl commented 2 years ago

Alright, closing this issue then, although I don't see 165df9f8845b421c1e238bdd809036ab869be108 there (yet).

martinwhitaker commented 2 years ago

Good catch. Forgot to sync my desktop computer before doing the merge.

debrouxl commented 2 years ago

When testing the latest USB handover commit on top of pcmemtest on the RS904A, I noticed that the extreme startup slowness is back, in both 32 core and 64 core modes. After double-checking that 165df9f8845b421c1e238bdd809036ab869be108 was alright, I checked 18f27343b3f983e95addc15a6ed94f1566197ba1 , which is still alright, and 3327b70dc1119b25322778a73a5ab6579326fd85 , which shows the slow startup, at high CPU usage (fan rotation rate increasing, hot air flowing from the back of the server).

I understand that 3327b70dc1119b25322778a73a5ab6579326fd85 is a correctness fix, so part of it should be kept, but for me, it definitely is an important optimization making a major difference to the startup delay :)

martinwhitaker commented 2 years ago

If it's slow during startup, it must also be slow when running the tests - there are far more barriers when the tests are run for real. So we should try to fix the general problem rather than concentrate on startup time. It must be possible to make spinlocks and barriers work efficiently on your machine, otherwise Linux would grind to a halt. Maybe there's some extra bit of initialisation required on a multi-socket platform to get the inter-processor links working at full speed.

debrouxl commented 2 years ago

I don't think I ever completed a pcmemtest run in 64 core mode, but in 32 core mode, testing the 16 x 16 GB PC3-12800R sticks takes ~24-25h on the RS904A, though next week-end, I'd have to double-check whether that's the figure for pcmemtest or for memtest86+ 5.01/5.31. I recently tried pcmemtest UEFI on a Coffee Lake laptop with 2 x 8 GB DDR4-2400: the test completed in an hour or so with 6/6 cores (HT disabled from the BIOS). A day to go through 16 times more memory with lower bandwidth didn't seem really bad to me, though there are significantly more cores working on memory, indeed.

I had tried forcing cache line alignment for the barriers and mutexes several weeks ago, but it didn't seem to help startup time, and while I didn't notice a speed improvement during the test, I wasn't looking for it and I aborted the test quickly.

martinwhitaker commented 2 years ago

Could you try the following patch to see if it helps.

--- a/app/main.c
+++ b/app/main.c
@@ -478,6 +478,7 @@ void main(void)
             }
             init_state = 2;
         } else {
+            cache_on();
             trace(my_cpu, "AP started");
             cpu_state[my_cpu] = CPU_STATE_RUNNING;
             while (init_state < 2) {

debrouxl commented 2 years ago

Nope, sorry, it doesn't seem to help: ~60 seconds in 32 core mode, be it with or without that patch.

martinwhitaker / pcmemtest

Display corruption with 64 cores, and very slow SMP test start with 32+ cores #16

include "error.h"

include "tests.h"

include "display.h"