php / php-src

The PHP Interpreter
https://www.php.net
Other
37.93k stars 7.72k forks source link

PHP 8.3.7 with JIT encounters infinite loop on specific paths #14475

Closed Appla closed 2 months ago

Appla commented 3 months ago

Description

I noticed that a PHP script was consuming a very high amount of CPU. I used strace to investigate and found that it was repeatedly executing fcntl. I used GDB to trace the process and found that it was continuously looping through a specific code path, as detailed in the following quote.

One scenario is that our script is hosted under a PHP process, where the host process forks a child process to execute the specific code.


* GDB `zbacktrace` shows code in PHP land blocking on ` self::$traceAt = now();`
* The following script just trying to show the core parts of the code which triggered the issue
```PHP
namespace {
    function now()
    {
        return date('Y-m-d H:i:s');
    }
}

namespace core\Context {
    class A {
        private static $traceAt;

        public static function resetTraceCtx()
        {
            self::$traceAt = now();
        }
    }
}

namespace third {
    $max = 10000;
    $i = 0;
    while ($i++ < $max) {
        \core\Context\A::resetTraceCtx();
        // poll info from queue
        usleep(100000);
    }
}
* loop in ASM

0x000000001300042a ? add $0x100,%rsp 0x0000000013000431 ? test %eax,%eax 0x0000000013000433 ? jne 0x13000443 0x0000000013000435 ? mov 0x1a28f28,%r14 0x000000001300043d ? mov (%r14),%r15 0x0000000013000440 ? jmpq (%r15) -> 0x0000000013000443 ? jl 0x13000330 0x0000000013000449 ? mov 0x1a28f28,%r14 0x0000000013000451 ? mov (%r14),%r15 0x0000000013000454 ? cmpb $0x0,0x1a28f56 0x000000001300045c ? jne 0x13000000 0x0000000013000462 ? mov 0x18(%r14),%rax 0x0000000013000466 ? mov 0xc0(%rax),%rax 0x000000001300046d ? mov 0xa0(%rax),%rax 0x0000000013000474 ? jmpq (%r15,%rax,1)

0x00000000130af820 ? movl $0x3ff,0x1a28f38 0x00000000130af82b ? mov 0x40(%r14),%rdx 0x00000000130af82f ? mov 0x8(%rdx),%rax 0x00000000130af833 ? cmp $0x54306e0,%rax 0x00000000130af83a ? jne 0x130af8c0 -> 0x00000000130af840 ? mov 0x1a28f08,%r15 0x00000000130af848 ? mov 0x1a28f10,%rdx 0x00000000130af850 ? sub %r15,%rdx 0x00000000130af853 ? cmp $0x60,%rdx 0x00000000130af857 ? jb 0x13000904 0x00000000130af85d ? addq $0x60,0x1a28f08 0x00000000130af866 ? movl $0x0,0x28(%r15) 0x00000000130af86e ? mov %rax,0x18(%r15)

0x00000000130af8c0 ? mov $0x4fb4318,%rdi 0x00000000130af8c7 ? lea 0x8(%rdx),%rsi 0x00000000130af8cb ? movabs $0x7f3d4aa7e610,%rax // static zend_function ZEND_FASTCALL zend_jit_find_ns_func_helper(zval func_name, void **cache_slot)

0x00000000130af8d5 ? callq *%rax // return back 0x00000000130af8d7 ? cmp $0x54306e0,%rax 0x00000000130af8de ? je 0x130af840 0x00000000130af8e4 ? jmpq 0x13000900 -> 0x00000000130af8e9 ? add %al,(%rax) 0x00000000130af8eb ? add %al,(%rax) 0x00000000130af8ed ? add %al,(%rax) 0x00000000130af8ef ? add %cl,-0x39(%rcx) 0x00000000130af8f2 ? (bad)
0x00000000130af8f3 ? (bad)
0x00000000130af8f4 ? or $0x41,%al

0x0000000013000900 ? pushq $0x0 0x0000000013000902 ? jmp 0x1300097e ->

0x000000001300097e ? addq $0x0,(%rsp) 0x0000000013000983 ? jmpq 0x13000340 -> 0x0000000013000988 ? add %al,(%rax) 0x000000001300098a ? add %al,(%rax) 0x000000001300098c ? add %al,(%rax)

0x0000000013000340 ? sub $0xf8,%rsp 0x0000000013000347 ? mov %r15,0x78(%rsp) 0x000000001300034c ? mov %r11,0x58(%rsp) 0x0000000013000351 ? mov %r10,0x50(%rsp) 0x0000000013000356 ? mov %r9,0x48(%rsp) 0x000000001300035b ? mov %r8,0x40(%rsp) 0x0000000013000360 ? mov %rdi,0x38(%rsp) 0x0000000013000365 ? mov %rsi,0x30(%rsp) 0x000000001300036a ? mov %rdx,0x10(%rsp) 0x000000001300036f ? mov %rcx,0x8(%rsp) 0x0000000013000374 ? mov %rax,(%rsp) 0x0000000013000378 ? mov 0xf8(%rsp),%rdi 0x0000000013000380 ? mov %rsp,%rsi 0x0000000013000383 ? movsd %xmm15,0xf8(%rsp) 0x000000001300038d ? movsd %xmm14,0xf0(%rsp) 0x0000000013000397 ? movsd %xmm13,0xe8(%rsp) 0x00000000130003a1 ? movsd %xmm12,0xe0(%rsp) 0x00000000130003ab ? movsd %xmm11,0xd8(%rsp) 0x00000000130003ab ? movsd %xmm11,0xd8(%rsp) 0x00000000130003b5 ? movsd %xmm10,0xd0(%rsp) 0x00000000130003bf ? movsd %xmm9,0xc8(%rsp) 0x00000000130003c9 ? movsd %xmm8,0xc0(%rsp) 0x00000000130003d3 ? movsd %xmm7,0xb8(%rsp) 0x00000000130003dc ? movsd %xmm6,0xb0(%rsp) 0x00000000130003e5 ? movsd %xmm5,0xa8(%rsp) 0x00000000130003ee ? movsd %xmm4,0xa0(%rsp) 0x00000000130003f7 ? movsd %xmm3,0x98(%rsp) 0x0000000013000400 ? movsd %xmm2,0x90(%rsp) 0x0000000013000409 ? movsd %xmm1,0x88(%rsp) 0x0000000013000412 ? movsd %xmm0,0x80(%rsp) 0x000000001300041b ? mov %r15,(%r14) 0x000000001300041e ? movabs $0x7f3d4ab05090,%rax // int ZEND_FASTCALL zend_jit_trace_exit(uint32_t exit_num, zend_jit_registers_buf regs) 0x0000000013000428 ? callq %rax 0x000000001300042a ? add $0x100,%rsp // loop again 0x0000000013000431 ? test %eax,%eax



### PHP Version

PHP 8.3.7

### Operating System

Rocky Linux 8
Appla commented 3 months ago

After more dig this part may be the root cause https://github.com/php/php-src/blob/PHP-8.3/ext/opcache/jit/zend_jit_trace.c#L8331-L8353 https://github.com/php/php-src/blob/PHP-8.3/ext/opcache/jit/zend_jit_x86.dasc#L9465:L9473

iluuu1994 commented 3 months ago

/cc @dstogov

dstogov commented 3 months ago

I don't see any problems running the provided example. It records and executes just a single trace.

It seems, in your environment the trace is invalidated (for some reason) then recorded and executed again. Do you regenerate code for some PHP script or touch it (update its mtime)? Then this is the expected behavoir.

Try to run the code with -dopcache.jit_debug=0x1ff105 to collect more info...

Appla commented 3 months ago

I don't see any problems running the provided example. It records and executes just a single trace.

It seems, in your environment the trace is invalidated (for some reason) then recorded and executed again. Do you regenerate code for some PHP script or touch it (update its mtime)? Then this is the expected behavoir.

yes, the file may has changed during execution also the file mtime

Try to run the code with -dopcache.jit_debug=0x1ff105 to collect more info...

I was set jit_debug=0x1fffff in GDB and this part are hit

It looks like this code are hit in zend_jit_trace_exit and continue to execution original opline but not switch to interpreter and again and again

 if (ZEND_JIT_TRACE_NUM >= JIT_G(max_root_traces)) {
    /* skip: too many root traces */
}

the process still running with GDB attached, any other info I can provided?

dstogov commented 3 months ago

yes, the file may has changed during execution also the file mtime

In case some of the files is changed on each "request", the trace is going to be invalidated and recompiled on each "request". It's better to add this file in the opcache blacklist.

I understood, the problem occurs only after the "max_root_traces" limit is exceeded. Right? The number of compiled traces is not caused by a bug. It's caused by trace invalidation. Right?

It would be great if you could construct a test case. This way I will able to fix the exact problem.

Appla commented 3 months ago

yes, the file may has changed during execution also the file mtime

In case some of the files is changed on each "request", the trace is going to be invalidated and recompiled on each "request". It's better to add this file in the opcache blacklist.

My usage scenario involves a specific PHP script managing the execution of other PHP scripts, ensuring that the updated script content is used immediately upon file modification. I am unsure whether JIT is necessary for this type of requirement.

I understood, the problem occurs only after the "max_root_traces" limit is exceeded. Right? The number of compiled traces is not caused by a bug. It's caused by trace invalidation. Right?

Please disregard my speculation, I am not familiar with the implementation of this part.

It would be great if you could construct a test case. This way I will able to fix the exact problem.

This is a script running in our test environment. I'm not sure if it can be isolated for reproduction. I'll try to see. btw: this is the only script out of many that consistently trigger this issue.

dstogov commented 3 months ago

My usage scenario involves a specific PHP script managing the execution of other PHP scripts, ensuring that the updated script content is used immediately upon file modification. I am unsure whether JIT is necessary for this type of requirement.

It's better to disable not only JIT, but also opcache, because your invalidated scripts re-cached again and again. Instead of complete disabling you may blacklist specific files.

Please disregard my speculation, I am not familiar with the implementation of this part.

I just trying to interpret your reference to the code: if (ZEND_JIT_TRACE_NUM >= JIT_G(max_root_traces)) {, but without the reproduction I may only guess.

Appla commented 3 months ago

My usage scenario involves a specific PHP script managing the execution of other PHP scripts, ensuring that the updated script content is used immediately upon file modification. I am unsure whether JIT is necessary for this type of requirement.

It's better to disable not only JIT, but also opcache, because your invalidated scripts re-cached again and again. Instead of complete disabling you may blacklist specific files.

I have a large number of scripts to execute, and script updates are relatively infrequent. Would disabling opcache have any performance or anything else implications?

Please disregard my speculation, I am not familiar with the implementation of this part.

I just trying to interpret your reference to the code: if (ZEND_JIT_TRACE_NUM >= JIT_G(max_root_traces)) {, but without the reproduction I may only guess.

I've tried isolating the relevant code, but the issue hasn't been reproduced in a short period. The affected script has been running the management process for over a week. Apart from reproducing the script, is there any information that can be obtained from the running process to help troubleshoot this issue? I can provide this kind information for now.

MrSuddenJoy commented 3 months ago

@dstogov no idea how you came to the conclusion that provided snippet works (with exact clone of user's env) and TBH I dont really wanna know..........

Appla commented 3 months ago

I don't see any problems running the provided example. It records and executes just a single trace.

It seems, in your environment the trace is invalidated (for some reason) then recorded and executed again. Do you regenerate code for some PHP script or touch it (update its mtime)? Then this is the expected behavoir.

Try to run the code with -dopcache.jit_debug=0x1ff105 to collect more info...

The issue was reproduced after running with the -dopcache.jit_debug=0x1ff105 option for a few days.

after TRACE 1023 (reach max_root_traces?), the issue happened as above described

---- TRACE 1020 TSSA start (enter) core\Context::resetTraceCtx() /path/to/class/Context.php:119
0000 INIT_NS_FCALL_BY_NAME 0 string("core\\now")
     >init now
0001 #0.V0 [ref, rc1, rcn, any] = DO_FCALL_BY_NAME
     >enter now
---- TRACE 1020 TSSA stop (link to 515)
TRACE-1020$core\Context::resetTraceCtx$119: ; (unknown)
     mov $0x3fc, EG(jit_trace_num)
     mov 0x40(%r14), %rdx
     mov 0x8(%rdx), %rax
     cmp $0x7b7a960, %rax
     jnz .L2
.L1:
     mov EG(vm_stack_top), %r15
     mov EG(vm_stack_end), %rdx
     sub %r15, %rdx
     cmp $0x60, %rdx
     jb jit$$trace_exit_1
     add $0x60, EG(vm_stack_top)
     mov $0x0, 0x28(%r15)
     mov %rax, 0x18(%r15)
     mov $0x0, 0x20(%r15)
     mov $0x0, 0x2c(%r15)
     mov $0x6f12a08, (%r14)
     mov %r14, 0x30(%r15)
     mov $0x0, 0x8(%r15)
     lea 0x50(%r14), %rdx
     mov %rdx, 0x10(%r15)
     mov CG(map_ptr_base), %rdx
     mov 0x237b1(%rdx), %rdx
     mov %rdx, 0x40(%r15)
     mov %r15, EG(current_execute_data)
     mov %r15, %r14
     jmp 0x13aa66f0
.L2:
     mov $0x6f12958, %rdi
     lea 0x8(%rdx), %rsi
     mov $zend_jit_find_ns_func_helper, %rax
     call *%rax
     cmp $0x7b7a960, %rax
     jz .L1
     jmp jit$$trace_exit_0

---- TRACE 1020 exit info
     exit_0: 0000/----/0
     exit_1: 0000/----/0/VM
---- TRACE 1020 compiled

     TRACE 1020 exit 0 core\Context::resetTraceCtx() /path/to/class/Context.php:119
---- TRACE 1021 start (enter) core\Context::resetTraceCtx() /path/to/class/Context.php:119
0000 INIT_NS_FCALL_BY_NAME 0 string("core\\now")
     >init now
0001 V0 = DO_FCALL_BY_NAME
     >enter now
---- TRACE 1021 stop (link to 396)
---- TRACE 1021 TSSA start (enter) core\Context::resetTraceCtx() /path/to/class/Context.php:119
0000 INIT_NS_FCALL_BY_NAME 0 string("core\\now")
     >init now
0001 #0.V0 [ref, rc1, rcn, any] = DO_FCALL_BY_NAME
     >enter now
---- TRACE 1021 TSSA stop (link to 396)
TRACE-1021$core\Context::resetTraceCtx$119: ; (unknown)
     mov $0x3fd, EG(jit_trace_num)
     mov 0x40(%r14), %rdx
     mov 0x8(%rdx), %rax
     cmp $0x7507420, %rax
     jnz .L2
.L1:
     mov EG(vm_stack_top), %r15
     mov EG(vm_stack_end), %rdx
     sub %r15, %rdx
     cmp $0x60, %rdx
     jb jit$$trace_exit_1
     add $0x60, EG(vm_stack_top)
     mov $0x0, 0x28(%r15)
     mov %rax, 0x18(%r15)
     mov $0x0, 0x20(%r15)
     mov $0x0, 0x2c(%r15)
     mov $0x6f12a08, (%r14)
     mov %r14, 0x30(%r15)
     mov $0x0, 0x8(%r15)
     lea 0x50(%r14), %rdx
     mov %rdx, 0x10(%r15)
     mov CG(map_ptr_base), %rdx
     mov 0x20639(%rdx), %rdx
     mov %rdx, 0x40(%r15)
     mov %r15, EG(current_execute_data)
     mov %r15, %r14
     jmp 0x13a7eab0
.L2:
     mov $0x6f12958, %rdi
     lea 0x8(%rdx), %rsi
     mov $zend_jit_find_ns_func_helper, %rax
     call *%rax
     cmp $0x7507420, %rax
     jz .L1
     jmp jit$$trace_exit_0

---- TRACE 1021 exit info
     exit_0: 0000/----/0
     exit_1: 0000/----/0/VM
---- TRACE 1021 compiled

     TRACE 1021 exit 0 core\Context::resetTraceCtx() /path/to/class/Context.php:119
---- TRACE 1022 start (enter) core\Context::resetTraceCtx() /path/to/class/Context.php:119
0000 INIT_NS_FCALL_BY_NAME 0 string("core\\now")
     >init now
0001 V0 = DO_FCALL_BY_NAME
     >enter now
---- TRACE 1022 stop (link to 515)
---- TRACE 1022 TSSA start (enter) core\Context::resetTraceCtx() /path/to/class/Context.php:119
0000 INIT_NS_FCALL_BY_NAME 0 string("core\\now")
     >init now
0001 #0.V0 [ref, rc1, rcn, any] = DO_FCALL_BY_NAME
     >enter now
---- TRACE 1022 TSSA stop (link to 515)
TRACE-1022$core\Context::resetTraceCtx$119: ; (unknown)
    mov $0x3fe, EG(jit_trace_num)
    mov 0x40(%r14), %rdx
    mov 0x8(%rdx), %rax
    cmp $0x7b7a960, %rax
    jnz .L2
.L1:
    mov EG(vm_stack_top), %r15
    mov EG(vm_stack_end), %rdx
    sub %r15, %rdx
    cmp $0x60, %rdx
    jb jit$$trace_exit_1
    add $0x60, EG(vm_stack_top)
    mov $0x0, 0x28(%r15)
    mov %rax, 0x18(%r15)
    mov $0x0, 0x20(%r15)
    mov $0x0, 0x2c(%r15)
    mov $0x6f12a08, (%r14)
    mov %r14, 0x30(%r15)
    mov $0x0, 0x8(%r15)
    lea 0x50(%r14), %rdx
    mov %rdx, 0x10(%r15)
    mov CG(map_ptr_base), %rdx
    mov 0x237b1(%rdx), %rdx
    mov %rdx, 0x40(%r15)
    mov %r15, EG(current_execute_data)
    mov %r15, %r14
    jmp TRACE-515$now$55
.L2:
    mov $0x6f12958, %rdi
    lea 0x8(%rdx), %rsi
    mov $zend_jit_find_ns_func_helper, %rax
    call *%rax
    cmp $0x7b7a960, %rax
    jz .L1
    jmp jit$$trace_exit_0

---- TRACE 1022 exit info
     exit_0: 0000/----/0
     exit_1: 0000/----/0/VM
---- TRACE 1022 compiled

     TRACE 1022 exit 0 core\Context::resetTraceCtx() /path/to/class/Context.php:119
---- TRACE 1023 start (enter) core\Context::resetTraceCtx() /path/to/class/Context.php:119
0000 INIT_NS_FCALL_BY_NAME 0 string("core\\now")
     >init now
0001 V0 = DO_FCALL_BY_NAME
     >enter now
---- TRACE 1023 stop (link to 396)
---- TRACE 1023 TSSA start (enter) core\Context::resetTraceCtx() /path/to/class/Context.php:119
0000 INIT_NS_FCALL_BY_NAME 0 string("core\\now")
     >init now
0001 #0.V0 [ref, rc1, rcn, any] = DO_FCALL_BY_NAME
     >enter now
---- TRACE 1023 TSSA stop (link to 396)
TRACE-1023$core\Context::resetTraceCtx$119: ; (unknown)
    mov $0x3ff, EG(jit_trace_num)
    mov 0x40(%r14), %rdx
    mov 0x8(%rdx), %rax
    cmp $0x7507420, %rax
    jnz .L2
.L1:
    mov EG(vm_stack_top), %r15
    mov EG(vm_stack_end), %rdx
    sub %r15, %rdx
    cmp $0x60, %rdx
    jb jit$$trace_exit_1
    add $0x60, EG(vm_stack_top)
    mov $0x0, 0x28(%r15)
    mov %rax, 0x18(%r15)
    mov $0x0, 0x20(%r15)
    mov $0x0, 0x2c(%r15)
    mov $0x6f12a08, (%r14)
    mov %r14, 0x30(%r15)
    mov $0x0, 0x8(%r15)
    lea 0x50(%r14), %rdx
    mov %rdx, 0x10(%r15)
    mov CG(map_ptr_base), %rdx
    mov 0x20639(%rdx), %rdx
    mov %rdx, 0x40(%r15)
    mov %r15, EG(current_execute_data)
    mov %r15, %r14
    jmp 0x13a7eab0
.L2:
    mov $0x6f12958, %rdi
    lea 0x8(%rdx), %rsi
    mov $zend_jit_find_ns_func_helper, %rax
    call *%rax
    cmp $0x7507420, %rax
    jz .L1
    jmp jit$$trace_exit_0

---- TRACE 1023 exit info
     exit_0: 0000/----/0
     exit_1: 0000/----/0/VM
---- TRACE 1023 compiled

//  after above, The debug output is basically all the same.
.........
TRACE 1023 exit 0 core\Context::resetTraceCtx() /path/to/class/Context.php:119
TRACE 1023 exit 0 core\Context::resetTraceCtx() /path/to/class/Context.php:119
TRACE 1023 exit 0 core\Context::resetTraceCtx() /path/to/class/Context.php:119
TRACE 1023 exit 0 core\Context::resetTraceCtx() /path/to/class/Context.php:119
.........

Is anything else needed?

dstogov commented 3 months ago

I hope https://github.com/php/php-src/pull/14558 should fix this @Appla it would be great if you could test the patch

MrSuddenJoy commented 3 months ago

@dstogov for me your patch works well.

Appla commented 3 months ago

I hope #14558 should fix this @Appla it would be great if you could test the patch

I will feedback later.

I also have a question: what specific conditions could cause this issue with simple code? I am sure the file and mtime not changed during execution.

dstogov commented 3 months ago

@Appla without a reproduction, I can only guess about your case. The patch fixes a case when a root trace, started from INIT_*FCALL* instruction, is "invalidated" after "jit_max_root_trace" limit is exceeded.