Open mateusz-bloch opened 6 months ago
It may be related with: https://github.com/phoenix-rtos/phoenix-rtos-project/issues/885
Why reopen? It wasn't related, this issue was caused by kstack overflow on ia32
-d cpu_reset
to qemu argumentsIssue caught another time in exit test, here is the output with -d cpu_reset
enabled:
https://github.com/phoenix-rtos/phoenix-rtos-project/actions/runs/9072269435/job/24927640850
(psh)% /bin/test-libc-exit
Unity test run 1 of 1
Triple fault
CPU Reset (CPU 0)
EAX=00000013 EBX=000000c8 ECX=000000ba EDX=c011340e
ESI=00000000 EDI=00000000 EBP=000000c7 ESP=c02d0000
EIP=c0113409 EFL=00003046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0023 00000000 c0000fff 00ccf300 DPL=3 DS [-WA]
CS =0008 00000000 ffffffff 00cf9a00 DPL=0 CS32 [-R-]
SS =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA]
DS =0023 00000000 c0000fff 00ccf300 DPL=3 DS [-WA]
FS =0023 00000000 c0000fff 00ccf300 DPL=3 DS [-WA]
GS =0033 00001004 bfffffff 00cbf300 DPL=3 DS [-WA]
LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
TR =0028 c0147800 00000068 00008900 DPL=0 TSS32-avl
GDT= c0001000 000007ff
IDT= c0001800 000007ff
CR0=80000033 CR2=c02cfffc CR3=002e7000 CR4=00000010
DR0=00000000 DR1=00000000 DR2=00000000 DR3=00000000
DR6=ffff0ff0 DR7=00000400
CCS=0000000c CCD=00000000 CCO=LOGICL
EFER=0000000000000000
FCW=ff90 FSW=8184 [ST=0] FTW=ff MXCSR=00001f80
FPR0=00000004c031d354 4f84 FPR1=4f8800000004c02e c02e
FPR2=c01cb65000000004 5000 FPR3=0a0000000234c02e c031
FPR4=00000001c031d300 d0e0 FPR5=0000c01223f9c02d 0000
FPR6=0000329200000000 2390 FPR7=000100000000c012 0000
XMM00=0000000000000000 0000000000000000 XMM01=0000000000000000 0000000000000000
XMM02=0000000000000000 0000000000000000 XMM03=0000000000000000 0000000000000000
XMM04=0000000000000000 0000000000000000 XMM05=0000000000000000 0000000000000000
XMM06=0000000000000000 0000000000000000 XMM07=0000000000000000 0000000000000000
Triple fault
CPU Reset (CPU 0)
EAX=000f6106 EBX=000f3e0a ECX=00000000 EDX=00000cf9
ESI=00000000 EDI=00100000 EBP=00000000 ESP=00000fc8
EIP=000efb0a EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA]
CS =0008 00000000 ffffffff 00cf9b00 DPL=0 CS32 [-RA]
SS =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA]
DS =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA]
FS =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA]
GS =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA]
LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy
GDT= 000f6180 00000037
IDT= 000f61be 00000000
CR0=00000011 CR2=00000000 CR3=00000000 CR4=00000000
DR0=00000000 DR1=00000000 DR2=00000000 DR3=00000000
DR6=ffff0ff0 DR7=00000400
CCS=000f61c8 CCD=00009e34 CCO=SUBL
EFER=0000000000000000
FCW=037f FSW=0000 [ST=0] FTW=00 MXCSR=00001f80
FPR0=0000000000000000 0000 FPR1=0000000000000000 0000
FPR2=0000000000000000 0000 FPR3=0000000000000000 0000
FPR4=0000000000000000 0000 FPR5=0000000000000000 0000
FPR6=0000000000000000 0000 FPR7=0000000000000000 0000
XMM00=0000000000000000 0000000000000000 XMM01=0000000000000000 0000000000000000
XMM02=0000000000000000 0000000000000000 XMM03=0000000000000000 0000000000000000
XMM04=0000000000000000 0000000000000000 XMM05=0000000000000000 0000000000000000
XMM06=0000000000000000 0000000000000000 XMM07=0000000000000000 0000000000000000
7lSeaBIOS (version 1.15.0-1)
iPXE (https://ipxe.org/) 00:03.0 CA00 PCI2.10 PnP PMM+07F8B4A0+07ECB4A0 CA00
Press Ctrl-B to configure iPXE (PCI 00:03.0)...
Booting from Hard Disk...
Phoenix-RTOS loader v. 1.21 rev: 2e2476325l
hal: IA-32 Generic
cmd: Executing pre-init script
console: Setting console to 0.0
Waiting for input, 900 [ms]
Waiting for input, 800 [ms]
Waiting for input, 700 [ms]
Waiting for input, 600 [ms]
Waiting for input, 500 [ms]
Waiting for input, 400 [ms]
Waiting for input, 300 [ms]
Waiting for input, 200 [ms]
Waiting for input, 100 [ms]
Waiting for input, 0 [ms]
Waiting for input, 0 [ms]
25hPhoenix-RTOS microkernel v. 3.2 rev: 5570581
hal: GenuineIntel Family 6 Model 7 Stepping 3 (3/), cores=1
I've encounted this issue in an automatic test.
phoenix-rtos-tests/libc/exit: FAIL
EXPECTED:
0: ASSERTION (?P<path>[\\S]+):(?P<line>\\d+):(?P<status>FAIL|INFO|IGNORE): (?P<msg>.*?)\\r
1: TEST\\((?P<group>\\w+), (?P<name>\\w+)\\) (?P<status>PASS|IGNORE)
2: TEST\\((?P<group>\\w+), (?P<name>\\w+)\\) (?P<status>FAIL) at (?P<path>.*?):(?P<line>\\d+)\\r
3: (?P<total>\\d+) Tests (?P<fail>\\d+) Failures (?P<ignore>\\d+) Ignored \\r+\\n(?P<result>OK|FAIL)
GOT:
Unity test run 1 of 1
Triple fault
CPU Reset (CPU 0)
EAX=00000008 EBX=bffffe90 ECX=000000f7 EDX=c011340e
ESI=000000d5 EDI=00000000 EBP=000000f7 ESP=c02d0000
EIP=c0113409 EFL=00003046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0023 00000000 c0000fff 00ccf300 DPL=3 DS [-WA]
CS =0008 00000000 ffffffff 00cf9a00 DPL=0 CS32 [-R-]
SS =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA]
DS =0023 00000000 c0000fff 00ccf300 DPL=3 DS [-WA]
FS =0023 00000000 c0000fff 00ccf300 DPL=3 DS [-WA]
GS =0033 00001004 bfffffff 00cbf300 DPL=3 DS [-WA]
LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
TR =0028 c0147800 00000068 00008900 DPL=0 TSS32-avl
GDT= c0001000 000007ff
IDT= c0001800 000007ff
CR0=80000033 CR2=c02cfffc CR3=07fad000 CR4=00000010
DR0=00000000 DR1=00000000 DR2=00000000 DR3=00000000
DR6=ffff0ff0 DR7=00000400
CCS=0000000c CCD=00000000 CCO=LOGICL
EFER=0000000000000000
FCW=ff90 FSW=8184 [ST=0] FTW=ff MXCSR=00001f80
FPR0=00000004c02e1f54 4f84 FPR1=4f8800000004c02e c02e
FPR2=c01cb65000000004 5000 FPR3=fb0000000234c02e c030
FPR4=00000001c02e1f00 d0e0 FPR5=0000c0122429c02d 0000
FPR6=0000329200000000 23c0 FPR7=000100000000c012 0000
XMM00=0000000000000000 0000000000000000 XMM01=0000000000000000 0000000000000000
XMM02=0000000000000000 0000000000000000 XMM03=0000000000000000 0000000000000000
XMM04=0000000000000000 0000000000000000 XMM05=0000000000000000 0000000000000000
XMM06=0000000000000000 0000000000000000 XMM07=0000000000000000 0000000000000000
Triple fault
CPU Reset (CPU 0)
EAX=000f6106 EBX=000f3e0a ECX=00000000 EDX=00000cf9
ESI=00000000 EDI=00100000 EBP=00000000 ESP=00000fc8
EIP=000efb0a EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA]
CS =0008 00000000 ffffffff 00cf9b00 DPL=0 CS32 [-RA]
SS =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA]
DS =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA]
FS =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA]
GS =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA]
LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy
GDT= 000f6180 00000037
IDT= 000f61be 00000000
CR0=00000011 CR2=00000000 CR3=00000000 CR4=00000000
DR0=00000000 DR1=00000000 DR2=00000000 DR3=00000000
DR6=ffff0ff0 DR7=00000400
CCS=000f61c8 CCD=00009e34 CCO=SUBL
EFER=0000000000000000
FCW=037f FSW=0000 [ST=0] FTW=00 MXCSR=00001f80
FPR0=0000000000000000 0000 FPR1=0000000000000000 0000
FPR2=0000000000000000 0000 FPR3=0000000000000000 0000
FPR4=0000000000000000 0000 FPR5=0000000000000000 0000
FPR6=0000000000000000 0000 FPR7=0000000000000000 0000
XMM00=0000000000000000 0000000000000000 XMM01=0000000000000000 0000000000000000
XMM02=0000000000000000 0000000000000000 XMM03=0000000000000000 0000000000000000
XMM04=0000000000000000 0000000000000000 XMM05=0000000000000000 0000000000000000
XMM06=0000000000000000 0000000000000000 XMM07=0000000000000000 0000000000000000
7lSeaBIOS (version 1.15.0-1)
Issue encounted in psh-history test: https://github.com/phoenix-rtos/phoenix-rtos-project/actions/runs/9477002770/job/26110890308
(psh)% g,zm_Aux4kpbeNvy
psh: g,zm_Aux4kpbeNvy not found
Triple fault
CPU Reset (CPU 0)
EAX=00000008 EBX=000000f5 ECX=0807b491 EDX=c011340e
ESI=bffffd5c EDI=00000003 EBP=bfffff18 ESP=c02d0000
EIP=c0113409 EFL=00003046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0023 00000000 c0000fff 00ccf300 DPL=3 DS [-WA]
CS =0008 00000000 ffffffff 00cf9a00 DPL=0 CS32 [-R-]
SS =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA]
DS =0023 00000000 c0000fff 00ccf300 DPL=3 DS [-WA]
FS =0023 00000000 c0000fff 00ccf300 DPL=3 DS [-WA]
GS =0033 00001004 bfffffff 00cbf300 DPL=3 DS [-WA]
LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
TR =0028 c0147800 00000068 00008900 DPL=0 TSS32-avl
GDT= c0001000 000007ff
IDT= c0001800 000007ff
CR0=80000033 CR2=c02cfffc CR3=07fad000 CR4=00000010
DR0=00000000 DR1=00000000 DR2=00000000 DR3=00000000
DR6=ffff0ff0 DR7=00000400
CCS=0000000c CCD=00000000 CCO=LOGICL
EFER=0000000000000000
FCW=ffa0 FSW=88c5 [ST=1] FTW=ff MXCSR=00001f80
FPR0=000100000000c012 0000 FPR1=00000004c03266d4 4f84
FPR2=4f8800000004c02e c02e FPR3=c01cb65000000004 7000
FPR4=3a0000000234c02e c031 FPR5=00000001c0326680 1040
FPR6=0000c0122429c02d 0000 FPR7=0000329200000000 23c0
XMM00=0000000000000000 0000000000000000 XMM01=0000000000000000 0000000000000000
XMM02=0000000000000000 0000000000000000 XMM03=0000000000000000 0000000000000000
XMM04=0000000000000000 0000000000000000 XMM05=0000000000000000 0000000000000000
XMM06=0000000000000000 0000000000000000 XMM07=0000000000000000 0000000000000000
Triple fault
CPU Reset (CPU 0)
EAX=000f6106 EBX=000f3e0a ECX=00000000 EDX=00000cf9
ESI=00000000 EDI=00100000 EBP=00000000 ESP=00000fc8
EIP=000efb0a EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA]
CS =0008 00000000 ffffffff 00cf9b00 DPL=0 CS32 [-RA]
SS =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA]
DS =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA]
FS =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA]
GS =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA]
LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy
GDT= 000f6180 00000037
IDT= 000f61be 00000000
CR0=00000011 CR2=00000000 CR3=00000000 CR4=00000000
DR0=00000000 DR1=00000000 DR2=00000000 DR3=00000000
DR6=ffff0ff0 DR7=00000400
CCS=000f61c8 CCD=00009e34 CCO=SUBL
EFER=0000000000000000
FCW=037f FSW=0000 [ST=0] FTW=00 MXCSR=00001f80
FPR0=0000000000000000 0000 FPR1=0000000000000000 0000
FPR2=0000000000000000 0000 FPR3=0000000000000000 0000
FPR4=0000000000000000 0000 FPR5=0000000000000000 0000
FPR6=0000000000000000 0000 FPR7=0000000000000000 0000
XMM00=0000000000000000 0000000000000000 XMM01=0000000000000000 0000000000000000
XMM02=0000000000000000 0000000000000000 XMM03=0000000000000000 0000000000000000
XMM04=0000000000000000 0000000000000000 XMM05=0000000000000000 0000000000000000
XMM06=0000000000000000 0000000000000000 XMM07=0000000000000000 0000000000000000
I've noticed that in every crash report in this thread, there is some garbage data in FPU registers that looks like data from stack. AFAIK it is not supposed to happen, but in theory, it shouldn't damage the stack.
I've noticed that in every crash report in this thread, there is some garbage data in FPU registers that looks like data from stack. AFAIK it is not supposed to happen, but in theory, it shouldn't damage the stack.
I've decided to look into it. Since there is an issue with vfork()
(https://github.com/phoenix-rtos/phoenix-rtos-project/issues/1077), I've decided to check if it works correctly with FPU (I've checked only fork()
). It doesn't and somehow it caused a pagefault at exit. I'll fix that and maybe this issue will be fixed?
I've managed to reproduce the issue locally on QEMU 6.2.0 (4096M of RAM allocated for the machine). It is not very efficient, additional RAM is required to avoid crashes caused by zombie processes. In my case it crashed in the second execution of this program:
#include <stdio.h>
#include <stdlib.h>
static void func(size_t id) {
if (fork() == 0) {
for (size_t i = 0; i < 10000000; ++i) {
__asm__ volatile ("fwait");
__asm__ volatile ("fldz");
__asm__ volatile ("nop");
}
}
else {
int xxx;
__asm__ volatile ("fwait");
__asm__ volatile ("fldz");
__asm__ volatile ("nop");
wait(&xxx);
printf("%u\n", id);
}
exit(0);
}
int main(void)
{
for (size_t i = 0; i < 12800; ++i) {
if (fork() == 0) {
func(i);
}
}
for (size_t i = 0; i < 12800; ++i) {
int id;
int ret = wait(&id);
}
puts("");
return 0;
}
Crash register dump:
Triple fault CPU Reset (CPU 0) EAX=00000011 EBX=00000025 ECX=bfffff70 EDX=c011340e ESI=00000000 EDI=c19cf000 EBP=bfffff58 ESP=c19ba000 EIP=c0113409 EFL=00003046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0023 00000000 c0000fff 00ccf300 DPL=3 DS [-WA] CS =0008 00000000 ffffffff 00cf9a00 DPL=0 CS32 [-R-] SS =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA] DS =0023 00000000 c0000fff 00ccf300 DPL=3 DS [-WA] FS =0023 00000000 c0000fff 00ccf300 DPL=3 DS [-WA] GS =0033 00001004 bfffffff 00cbf300 DPL=3 DS [-WA] LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT TR =0028 c0149800 00000068 00008900 DPL=0 TSS32-avl GDT= c0001000 000007ff IDT= c0001800 000007ff CR0=80000033 CR2=c19b9ffc CR3=bf2c8000 CR4=00000010 DR0=00000000 DR1=00000000 DR2=00000000 DR3=00000000 DR6=ffff0ff0 DR7=00000400 CCS=0000000c CCD=00000000 CCO=LOGICL EFER=0000000000000000 FCW=ff90 FSW=8187 [ST=0] FTW=ff MXCSR=00001f80 FPR0=00000004c1a0dfd8 ef84 FPR1=ef8800000004c19c c19c FPR2=c01cd65000000004 1000 FPR3=f90000000234c19d c19f FPR4=00000001c1a0df80 b040 FPR5=0000c01224c9c19b 0000 FPR6=0000329200000000 2460 FPR7=000100000000c012 0000 XMM00=0000000000000000 0000000000000000 XMM01=0000000000000000 0000000000000000 XMM02=0000000000000000 0000000000000000 XMM03=0000000000000000 0000000000000000 XMM04=0000000000000000 0000000000000000 XMM05=0000000000000000 0000000000000000 XMM06=0000000000000000 0000000000000000 XMM07=0000000000000000 0000000000000000
As you can see, once again there is an issue with garbage data in the FPU. I'll try to reproduce this error again, and then check if my patch works.
EDIT: Another Triple fault.
EAX=00000011 EBX=00000025 ECX=bfffff70 EDX=c011340e ESI=00000000 EDI=c19cf000 EBP=bfffff58 ESP=c19ba000 EIP=c0113409 EFL=00003046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0023 00000000 c0000fff 00ccf300 DPL=3 DS [-WA] CS =0008 00000000 ffffffff 00cf9a00 DPL=0 CS32 [-R-] SS =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA] DS =0023 00000000 c0000fff 00ccf300 DPL=3 DS [-WA] FS =0023 00000000 c0000fff 00ccf300 DPL=3 DS [-WA] GS =0033 00001004 bfffffff 00cbf300 DPL=3 DS [-WA] LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT TR =0028 c0149800 00000068 00008900 DPL=0 TSS32-avl GDT= c0001000 000007ff IDT= c0001800 000007ff CR0=80000033 CR2=c19b9ffc CR3=bf31b000 CR4=00000010 DR0=00000000 DR1=00000000 DR2=00000000 DR3=00000000 DR6=ffff0ff0 DR7=00000400 CCS=0000000c CCD=00000000 CCO=LOGICL EFER=0000000000000000 FCW=ff90 FSW=8187 [ST=0] FTW=ff MXCSR=00001f80 FPR0=00000004c19d20d8 ef84 FPR1=ef8800000004c19c c19c FPR2=c01cd65000000004 1000 FPR3=da0000000234c19d c19f FPR4=00000001c19d2080 b040 FPR5=0000c01224c9c19b 0000 FPR6=0000329200000000 2460 FPR7=000100000000c012 0000 XMM00=0000000000000000 0000000000000000 XMM01=0000000000000000 0000000000000000 XMM02=0000000000000000 0000000000000000 XMM03=0000000000000000 0000000000000000 XMM04=0000000000000000 0000000000000000 XMM05=0000000000000000 0000000000000000 XMM06=0000000000000000 0000000000000000 XMM07=0000000000000000 0000000000000000
I've found the reason for the triple fault. After we execute fsave
in the exception handler, the system reports an exception 16, but since there is fsave
in the exception handler, we are stuck in an infinite loop, until we triple fault, because we ran out of stack space.
I've submitted changes that decrease likelihood of a crash in this branch: https://github.com/phoenix-rtos/phoenix-rtos-kernel/tree/astalke/RTOS-858 (at least in my test code, that I've included in one of comments above this one)
Unfortunately these changes don't fix the issue and I think the last commit may cause errors in FPU calculations. Unfortunately I don't have enough time to make a proper fix.
Update
The issue has been reopened as it may also be related to #885 and problems with the psh runfile test on ia32-generic-qemu. Currently, I haven't observed it occurring directly in exit tests.
Problem occures with merge of 28ab383e627fe1d26df5737b12a938fe5ec473a3 in phoenix-rtos-kernel
Encountering intermittent system reboots on the
ia32-generic-qemu
. Specifically, the issue occurs approximately5 out of 100
times when executing a test that involvesfork()
followed by test_common.test_exitPtr(EXIT_SUCCESS);. The expected behavior is for a SIGCHLD signal to be sent after the child process exits, but instead, the system reboots unexpectedly.Output from CI:
Example workflow from github: https://github.com/phoenix-rtos/phoenix-rtos-ports/actions/runs/7846751690/job/21414331565
Project version: 0f35de229136c44bdf98245beb0d76167be5ea42