swoole / swoole-src

🚀 Coroutine-based concurrency library for PHP
https://www.swoole.com
Apache License 2.0
18.27k stars 3.16k forks source link

swoole5.0.1 php8.1/8.0.23 使用table incr/set 出现段错误。附有coredump分析 #4954

Open wuyiwai opened 1 year ago

wuyiwai commented 1 year ago

Please answer these questions before submitting your issue.

What did you do? If possible, provide a simple script for reproducing the error.

我写了一个swoole的Server,部署在k8s上,在server里使用了table,总的大概预先分配了20g内存给table,程序的主要逻辑是在table中做set和incr操作,在这个过程中发现会使得程序异常,产生coredump。关键代码如下:

foreach ($params as $item)
{
    $countRet = $this->userMailFolderMap->incr('48142122:5', 'count', $count);
}

以上简单的代码暂时无法稳定复现,但启动服务后,24h内必定产生coredump。只能从coredump日志推出是由于incr导致。由于我incr的入口只有这一处,且上下文没有很特殊的逻辑,所以暂时无法提供稳定复现的php代码。

What did you expect to see?

如果该问题很明确且有解决方案,希望获取解决方案。 如果该问题不清晰或目前没有解决方案,由于目前没有方向进行继续定位,希望可以提供一些方向使得我可以继续进行debug,如果信息不够可以沟通继续补充。

What did you see instead?

目前在程序的stdout日志中,会打印出以下log,属于issue中的段错误类型

WARNING Server::check_worker_exit_status(): worker(pid=31972, id=3) abnormal exit, status=255, signal=0

下面是php 8.0.23版本下的调试分析 对coredump文件进入gdb调试后的堆栈信息如下:

(gdb) bt
#0  0x00007fb4f69f3aff in raise () from /usr/lib64/libc.so.6
#1  0x00007fb4f69c6ea5 in abort () from /usr/lib64/libc.so.6
#2  0x00007fb4f69c6d79 in __assert_fail_base.cold.0 () from /usr/lib64/libc.so.6
#3  0x00007fb4f69ec456 in __assert_fail () from /usr/lib64/libc.so.6
#4  0x00007fb4f2fc20c2 in swoole::FixedPool::alloc (this=0x36bf650, size=<optimized out>)
    at /usr/local/src/swoole/src/memory/fixed_pool.cc:149
#5  swoole::FixedPool::alloc (this=0x36bf650, size=<optimized out>)
    at /usr/local/src/swoole/src/memory/fixed_pool.cc:149
#6  0x00007fb4f2fc3c60 in swoole::Table::alloc_row (this=0x7fb4d2c55010)
    at /usr/local/src/swoole/include/swoole_table.h:278
#7  swoole::Table::set (this=this@entry=0x7fb4d2c55010, key=0x7faf725f36f8 "48142122:5", keylen=10,
    rowlock=rowlock@entry=0x39d02f8, out_flags=out_flags@entry=0x39d02cc)
    at /usr/local/src/swoole/src/memory/table.cc:322
#8  0x00007fb4f2f9c711 in zim_swoole_table_incr (execute_data=0x7faf72250600, return_value=0x7faf72250560)
    at /usr/local/src/swoole/ext-src/swoole_table.cc:369
#9  0x00000000007f82cd in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER ()
    at /usr/local/src/php/Zend/zend_vm_execute.h:1863
#10 execute_ex (ex=0x2) at /usr/local/src/php/Zend/zend_vm_execute.h:55176
#11 0x00007fb4f2f1d2dd in swoole::PHPCoroutine::main_func (arg=0x7fff63934bb0)
    at /usr/local/src/swoole/ext-src/swoole_coroutine.cc:700
#12 0x00007fb4f2fac680 in std::function<void (void*)>::operator()(void*) const (__args#0=<optimized out>,
    this=0x3b60910) at /usr/include/c++/8/bits/std_function.h:682
#13 swoole::coroutine::Context::context_func (arg=0x3b60910)
    at /usr/local/src/swoole/src/coroutine/context.cc:142
#14 0x00007fb4f302fd31 in swoole_make_fcontext ()
    at /usr/local/src/swoole/thirdparty/boost/asm/make_x86_64_sysv_elf_gas.S:70
#15 0x0000000000200010 in ?? ()
#16 0x0000000000000021 in ?? ()
#17 0x0000000003a492e0 in ?? ()
#18 0x00007fb4f6d65c30 in main_arena () from /usr/lib64/libc.so.6
#19 0x0000000000000020 in ?? ()
#20 0x00000000000000a0 in ?? ()
#21 0x0000000000000040 in ?? ()
#22 0x0000000000000000 in ?? ()

还有一些其他的core dump,是在php 8.1版本上测的,内容差不多,一起贴上来

#0  swoole::FixedPool::alloc (this=0x2c332b0, size=0) at /usr/local/src/swoole/src/memory/fixed_pool.cc:151
#1  0x00007f85b9e07fb0 in swoole::Table::alloc_row (this=0x7f8599a91010) at /usr/local/src/swoole/include/swoole_table.h:278
#2  swoole::Table::set (this=this@entry=0x7f8599a91010, key=key@entry=0x7f85999ae018 "48380494:1", keylen=keylen@entry=10, rowlock=rowlock@entry=0x5736248, out_flags=out_flags@entry=0x5736244)
    at /usr/local/src/swoole/src/memory/table.cc:322
#3  0x00007f85b9de112f in zim_swoole_table_set (execute_data=0x7f859968ba10, return_value=0x7f859968b990) at /usr/local/src/swoole/ext-src/swoole_table.cc:295
#4  0x00000000007f52c7 in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER () at /usr/local/src/php/Zend/zend_vm_execute.h:1870
#5  execute_ex (ex=0x2c332b0) at /usr/local/src/php/Zend/zend_vm_execute.h:55807
#6  0x00007f85b9d5b8ed in swoole::PHPCoroutine::main_func (arg=0x7ffde52b5910) at /usr/local/src/swoole/ext-src/swoole_coroutine.cc:700
#7  0x00007f85b9df09d0 in std::function<void (void*)>::operator()(void*) const (__args#0=<optimized out>, this=0x4f372b0) at /usr/include/c++/8/bits/std_function.h:682
#8  swoole::coroutine::Context::context_func (arg=0x4f372b0) at /usr/local/src/swoole/src/coroutine/context.cc:142
#9  0x00007f85b9e73bd1 in swoole_make_fcontext () at /usr/local/src/swoole/thirdparty/boost/asm/make_x86_64_sysv_elf_gas.S:70
#10 0x0000000000000000 in ?? ()
(gdb) bt
#0  swoole::FixedPool::alloc (this=0x2a6a650, size=0)
    at /usr/local/src/swoole/src/memory/fixed_pool.cc:151
#1  0x00007f0a4efc3c60 in swoole::Table::alloc_row (this=0x7f0a2ec55010)
    at /usr/local/src/swoole/include/swoole_table.h:278
#2  swoole::Table::set (this=this@entry=0x7f0a2ec55010, key=0x7f0a2e893798 "39927417:9", keylen=10,
    rowlock=rowlock@entry=0x2d7b528, out_flags=out_flags@entry=0x2d7b4fc)
    at /usr/local/src/swoole/src/memory/table.cc:322
#3  0x00007f0a4ef9c711 in zim_swoole_table_incr (execute_data=0x7f0a2ebe8600,
    return_value=0x7f0a2ebe8560) at /usr/local/src/swoole/ext-src/swoole_table.cc:369
#4  0x00000000007f82cd in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER ()
    at /usr/local/src/php/Zend/zend_vm_execute.h:1863
#5  execute_ex (ex=0x2a6a650) at /usr/local/src/php/Zend/zend_vm_execute.h:55176
#6  0x00007f0a4ef1d2dd in swoole::PHPCoroutine::main_func (arg=0x7ffeb60556b0)
    at /usr/local/src/swoole/ext-src/swoole_coroutine.cc:700
#7  0x00007f0a4efac680 in std::function<void (void*)>::operator()(void*) const (
    __args#0=<optimized out>, this=0x2b794a0) at /usr/include/c++/8/bits/std_function.h:682
#8  swoole::coroutine::Context::context_func (arg=0x2b794a0)
    at /usr/local/src/swoole/src/coroutine/context.cc:142
#9  0x00007f0a4f02fd31 in swoole_make_fcontext ()
    at /usr/local/src/swoole/thirdparty/boost/asm/make_x86_64_sysv_elf_gas.S:70
#10 0x0000000000000000 in ?? ()
(gdb)
(gdb) bt
#0  swoole::FixedPool::alloc (this=0x2c332b0, size=0) at /usr/local/src/swoole/src/memory/fixed_pool.cc:151
#1  0x00007f85b9e07fb0 in swoole::Table::alloc_row (this=0x7f8599a91010) at /usr/local/src/swoole/include/swoole_table.h:278
#2  swoole::Table::set (this=this@entry=0x7f8599a91010, key=key@entry=0x7f85999af3d8 "48307068:4234896397235", keylen=keylen@entry=22, rowlock=rowlock@entry=0x5531ab8, out_flags=out_flags@entry=0x5531ab4)
    at /usr/local/src/swoole/src/memory/table.cc:322
#3  0x00007f85b9de112f in zim_swoole_table_set (execute_data=0x7f85996b4a10, return_value=0x7f85996b4990) at /usr/local/src/swoole/ext-src/swoole_table.cc:295
#4  0x00000000007f52c7 in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER () at /usr/local/src/php/Zend/zend_vm_execute.h:1870
#5  execute_ex (ex=0x2c332b0) at /usr/local/src/php/Zend/zend_vm_execute.h:55807
#6  0x00007f85b9d5b8ed in swoole::PHPCoroutine::main_func (arg=0x7ffde52b5910) at /usr/local/src/swoole/ext-src/swoole_coroutine.cc:700
#7  0x00007f85b9df09d0 in std::function<void (void*)>::operator()(void*) const (__args#0=<optimized out>, this=0x3b353e0) at /usr/include/c++/8/bits/std_function.h:682
#8  swoole::coroutine::Context::context_func (arg=0x3b353e0) at /usr/local/src/swoole/src/coroutine/context.cc:142
#9  0x00007f85b9e73bd1 in swoole_make_fcontext () at /usr/local/src/swoole/thirdparty/boost/asm/make_x86_64_sysv_elf_gas.S:70
#10 0x0000000000000000 in ?? ()

What version of Swoole are you using (show your php --ri swoole)?

Swoole => enabled
Author => Swoole Team <team@swoole.com>
Version => 5.0.1
Built => Jan 10 2023 09:13:22
coroutine => enabled with boost asm context
epoll => enabled
eventfd => enabled
signalfd => enabled
cpu_affinity => enabled
spinlock => enabled
rwlock => enabled
openssl => OpenSSL 1.1.1k  FIPS 25 Mar 2021
dtls => enabled
http2 => enabled
json => enabled
curl-native => enabled
zlib => 1.2.11
mutex_timedlock => enabled
pthread_barrier => enabled
futex => enabled
async_redis => enabled

Directive => Local Value => Master Value
swoole.enable_coroutine => On => On
swoole.enable_library => On => On
swoole.enable_preemptive_scheduler => Off => Off
swoole.display_errors => On => On
swoole.use_shortname => On => On
swoole.unixsock_buffer_size => 8388608 => 8388608

What is your machine environment used (show your uname -a & php -v & gcc -v) ?

linux
Linux crm-mail-unread-server-0 4.19.91-26.1.al7.x86_64 #1 SMP Tue Jul 26 17:52:28 CST 2022 x86_64 x86_64 x86_64GNU/Linux

php
PHP 8.0.23 (cli) (built: Jan 10 2023 09:10:32) ( NTS )Copyright (c) The PHP Group
Zend Engine v4.0.23, Copyright (c) Zend Technologies
    with Zend OPcache v8.0.23, Copyright (c), by Zend Technologies

gcc
Using built-in specs.COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/8/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-redhat-linux
Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,lto --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://bugs.rockylinux.org/ --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --with-linker-hash-style=gnu --enable-plugin --enable-initfini-array --with-isl --disable-libmpx --enable-offload-targets=nvptx-none --without-cuda-driver --enable-gnu-indirect-function --enable-cet --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
Thread model: posix
gcc version 8.5.0 20210514 (Red Hat 8.5.0-15) (GCC)
NathanFreeman commented 1 year ago

这个看起来像是地址问题,gdb的时候输入f 0进入alloc函数,打印一下slice看看。

wuyiwai commented 1 year ago
以下是完整的调试信息
#0  swoole::FixedPool::alloc (this=0x2a6a650, size=0)
    at /usr/local/src/swoole/src/memory/fixed_pool.cc:151
#1  0x00007f0a4efc3c60 in swoole::Table::alloc_row (this=0x7f0a2ec55010)
    at /usr/local/src/swoole/include/swoole_table.h:278
#2  swoole::Table::set (this=this@entry=0x7f0a2ec55010,
    key=0x7f0a2e893798 "39927417:9", keylen=10,
    rowlock=rowlock@entry=0x2d7b528, out_flags=out_flags@entry=0x2d7b4fc)
    at /usr/local/src/swoole/src/memory/table.cc:322
#3  0x00007f0a4ef9c711 in zim_swoole_table_incr (execute_data=0x7f0a2ebe8600,
    return_value=0x7f0a2ebe8560)
    at /usr/local/src/swoole/ext-src/swoole_table.cc:369
#4  0x00000000007f82cd in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER ()
    at /usr/local/src/php/Zend/zend_vm_execute.h:1863
#5  execute_ex (ex=0x2a6a650)
    at /usr/local/src/php/Zend/zend_vm_execute.h:55176
#6  0x00007f0a4ef1d2dd in swoole::PHPCoroutine::main_func (arg=0x7ffeb60556b0)
    at /usr/local/src/swoole/ext-src/swoole_coroutine.cc:700
#7  0x00007f0a4efac680 in std::function<void (void*)>::operator()(void*) const    (__args#0=<optimized out>, this=0x2b794a0)
    at /usr/include/c++/8/bits/std_function.h:682
#8  swoole::coroutine::Context::context_func (arg=0x2b794a0)
    at /usr/local/src/swoole/src/coroutine/context.cc:142
#9  0x00007f0a4f02fd31 in swoole_make_fcontext ()

接下来一步步打印其中的变量
(gdb) f 0#0  swoole::FixedPool::alloc (this=0x2a6a650, size=0)    at /usr/local/src/swoole/src/memory/fixed_pool.cc:151151     in /usr/local/src/swoole/src/memory/fixed_pool.cc(gdb) p slice
$1 = (swoole::FixedPoolSlice *) 0x0
(gdb) p impl
$2 = (swoole::FixedPoolImpl *) 0x7f0a11b33008

(gdb) f 1
#1  0x00007f0a4efc3c60 in swoole::Table::alloc_row (this=0x7f0a2ec55010)
    at /usr/local/src/swoole/include/swoole_table.h:278
278     in /usr/local/src/swoole/include/swoole_table.h
(gdb) p new_row
$4 = <optimized out>(gdb) p pool
$5 = (swoole::FixedPool *) 0x2a6a650

(gdb) f 2
#2  swoole::Table::set (this=this@entry=0x7f0a2ec55010,    key=0x7f0a2e893798 "39927417:9", keylen=10,
    rowlock=rowlock@entry=0x2d7b528, out_flags=out_flags@entry=0x2d7b4fc)
    at /usr/local/src/swoole/src/memory/table.cc:322
322     /usr/local/src/swoole/src/memory/table.cc: No such file or directory.
(gdb) p key
$1 = 0x7f0a2e893798 "39927417:9"
(gdb) p keylen
$2 = 10
(gdb) p rowlock
$3 = (swoole::TableRow **) 0x2d7b528
(gdb) p out_flags
$4 = (int *) 0x2d7b4fc
(gdb) p keylen
$1 = 10
(gdb) p row
$2 = (swoole::TableRow *) 0x7f09a63ac9f8
(gdb) p _conflict_level
$3 = <optimized out>
(gdb)p row->active
$5 = 1 '\001'
(gdb) p row->key
$6 = "48379855:1797613", '\000' <repeats 47 times>
(gdb) p key
$7 = 0x7f0a2e893798 "39927417:9"
(gdb) p keylen
$8 = 10
(gdb) p row->key_len
$9 = 16 '\020'
(gdb) p row->next
$10 = (swoole::TableRow *) 0x0
(gdb) p conflict_count
$11 = 112868
(gdb) p _conflict_level
$12 = <optimized out>
(gdb) p conflict_max_level
$13 = 4
(gdb) p new_row
$14 = <optimized out>
(gdb) p insert_count
$15 = 1977079
(gdb) p update_count
$16 = 2697062

(gdb)f 3
#3  0x00007f0a4ef9c711 in zim_swoole_table_incr (execute_data=0x7f0a2ebe8600,
    return_value=0x7f0a2ebe8560)
    at /usr/local/src/swoole/ext-src/swoole_table.cc:369
369     /usr/local/src/swoole/ext-src/swoole_table.cc: No such file or directory.
(gdb) p table
$18 = (swoole::Table *) 0x7f0a2ec55010
(gdb) p key_len
$19 = 10
(gdb) p col_len
$20 = 5
(gdb) p incrby
$21 = (zval *) 0x7f0a2ebe8670
(gdb) p key
$22 = 0x7f0a2e893798 "39927417:9"
(gdb) p col
$23 = 0x7f0a2e913c08 "count"
(gdb) p incrby
$24 = (zval *) 0x7f0a2ebe8670
(gdb) p out_flags
$25 = 0
(gdb) p _rowlock
$26 = (swoole::TableRow *) 0x7f09a63ac9f8
(gdb) p row
$27 = <optimized out>

(gdb) f 6
#6  0x00007f0a4ef1d2dd in swoole::PHPCoroutine::main_func (arg=0x7ffeb60556b0)
    at /usr/local/src/swoole/ext-src/swoole_coroutine.cc:700
700     /usr/local/src/swoole/ext-src/swoole_coroutine.cc: No such file or directory.
(gdb) p current_execute_data
No symbol "current_execute_data" in current context.
(gdbp call
$1 = (zend_execute_data *) 0x7f0a2ebe80e0
(gdb) p func
$2 = (zend_function *) 0x7f0a2e92eed0
(gdb) p func->op_array
$3 = {type = 13 '\r', arg_flags = "/\177", fn_flags = 0, function_name = 0x0,
  scope = 0x1b0, prototype = 0x400003c0000050b, num_args = 8327985,
  required_num_args = 0, arg_info = 0x1b000000160, attributes = 0x34,
  cache_size = 1291, last_var = 264214, T = 8324372, last = 0,
  opcodes = 0xfffff0c800779f70, run_time_cache__ptr = 0xd0000001b0,
  static_variables_ptr__ptr = 0x20100520000050c,
  static_variables = 0x7f1b9d <execute_ex+6333>, vars = 0xfffff0b8000001b0,
  refcount = 0x3000000f8, last_live_range = 1292, last_try_catch = 66160,
  live_range = 0x7f3237 <execute_ex+12119>, try_catch_array = 0x100000150,
  filename = 0x50, line_start = 1292, line_end = 2114,
  doc_comment = 0x7f468a <execute_ex+17322>, last_literal = -4072,
  literals = 0x60, reserved = {0x1740000050c, 0x7f3237 <execute_ex+12119>,
    0x300000120, 0x70, 0x8420000050c, 0x7f2f0d <execute_ex+11309>}}
(gdb) p retval
$4 = (zval *) 0x2d7b650
(gdb) p func->type
$5 = 13 '\r'

(gdb) f 8
#8  swoole::coroutine::Context::context_func (arg=0x2b794a0)
    at /usr/local/src/swoole/src/coroutine/context.cc:142
142     /usr/local/src/swoole/src/coroutine/context.cc: No such file or directory.
(gdb) p arg
$6 = (void *) 0x2b794a0
(gdb) p _this
$7 = (swoole::coroutine::Context *) 0x2b794a0
(gdb) p _this->private_data_
$8 = (void *) 0x7ffeb60556b0
(gdb) p _this->fn_
$9 = {<std::_Maybe_unary_or_binary_function<void, void*>> = {<std::unary_function<void*, void>> = {<No d
ata fields>}, <No data fields>}, <std::_Function_base> = {static _M_max_size = 16, static _M_max_align =
 8, _M_functor = {_M_unused = {
        _M_object = 0x7f0a4ef1cee0 <swoole::PHPCoroutine::main_func(void*)>,
        _M_const_object = 0x7f0a4ef1cee0 <swoole::PHPCoroutine::main_func(void*)>,
        _M_function_pointer = 0x7f0a4ef1cee0 <swoole::PHPCoroutine::main_func(void*)>,
        _M_member_pointer = (void (std::_Undefined_class::*)(std::_Undefined_class * const)) 0x7f0a4ef1c
ee0 <swoole::PHPCoroutine::main_func(void*)>, this adjustment 139682314584409},
      _M_pod_data = "\340\316\361N\n\177\000\000Y\001\276R\n\177\000"},
    _M_manager = 0x7f0a4ef1fcd0 <std::_Function_base::_Base_manager<void (*)(void*)>::_M_manager(std::_A
ny_data&, std::_Any_data const&, std::_Manager_operation)>},
  _M_invoker = 0x7f0a4ef1fcc0 <std::_Function_handler<void (void*), void (*)(void*)>::_M_invoke(std::_An
y_data const&, void*&&)>}
(gdb) p _this->end_
$10 = false

还需要更详细的信息吗,我会及时补充上来 @NathanFreeman

NathanFreeman commented 1 year ago

slice在这里是个空指针,所以段错误了。我看看是为啥。

wuyiwai commented 1 year ago

slice在这里是个空指针,所以段错误了。我看看是为啥。 感谢,如果还需要更多关于程序逻辑或者相关信息可以直接@我,我会及时补充相关

NathanFreeman commented 1 year ago

可以gdb之后, f 0 进去alloc看看impl->head是不是空指针吗

wuyiwai commented 1 year ago

可以gdb之后, f 0 进去alloc看看impl->head是不是空指针吗

(gdb) p impl $1 = (swoole::FixedPoolImpl ) 0x7fb4b5b33008 (gdb) p impl->head $2 = (swoole::FixedPoolSlice ) 0x7fb4d0b68348

NathanFreeman commented 1 year ago

谢谢

wuyiwai commented 1 year ago

应该的。目前的现状是。我曾经部署过两个版本的swoole服务在k8s上。swoole的版本都是5.0.1,但php的版本曾经打过8.1和8.0.23的镜像。

8.1的gdb分析是这样的:

(gdb) bt
#0  swoole::FixedPool::alloc (this=0x2a6a650, size=0)
    at /usr/local/src/swoole/src/memory/fixed_pool.cc:151
#1  0x00007f0a4efc3c60 in swoole::Table::alloc_row (this=0x7f0a2ec55010)
    at /usr/local/src/swoole/include/swoole_table.h:278
#2  swoole::Table::set (this=this@entry=0x7f0a2ec55010, key=0x7f0a2e893798 "39927417:9", keylen=10,
    rowlock=rowlock@entry=0x2d7b528, out_flags=out_flags@entry=0x2d7b4fc)
    at /usr/local/src/swoole/src/memory/table.cc:322
#3  0x00007f0a4ef9c711 in zim_swoole_table_incr (execute_data=0x7f0a2ebe8600,
    return_value=0x7f0a2ebe8560) at /usr/local/src/swoole/ext-src/swoole_table.cc:369
#4  0x00000000007f82cd in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER ()
    at /usr/local/src/php/Zend/zend_vm_execute.h:1863
#5  execute_ex (ex=0x2a6a650) at /usr/local/src/php/Zend/zend_vm_execute.h:55176
#6  0x00007f0a4ef1d2dd in swoole::PHPCoroutine::main_func (arg=0x7ffeb60556b0)
    at /usr/local/src/swoole/ext-src/swoole_coroutine.cc:700
#7  0x00007f0a4efac680 in std::function<void (void*)>::operator()(void*) const (
    __args#0=<optimized out>, this=0x2b794a0) at /usr/include/c++/8/bits/std_function.h:682
#8  swoole::coroutine::Context::context_func (arg=0x2b794a0)
    at /usr/local/src/swoole/src/coroutine/context.cc:142
#9  0x00007f0a4f02fd31 in swoole_make_fcontext ()
    at /usr/local/src/swoole/thirdparty/boost/asm/make_x86_64_sysv_elf_gas.S:70
#10 0x0000000000000000 in ?? ()
(gdb)

8.0.23 (当前部署的版本)

(gdb) bt
#0  0x00007fb4f69f3aff in raise () from /usr/lib64/libc.so.6
#1  0x00007fb4f69c6ea5 in abort () from /usr/lib64/libc.so.6
#2  0x00007fb4f69c6d79 in __assert_fail_base.cold.0 () from /usr/lib64/libc.so.6
#3  0x00007fb4f69ec456 in __assert_fail () from /usr/lib64/libc.so.6
#4  0x00007fb4f2fc20c2 in swoole::FixedPool::alloc (this=0x36bf650, size=<optimized out>)
    at /usr/local/src/swoole/src/memory/fixed_pool.cc:149
#5  swoole::FixedPool::alloc (this=0x36bf650, size=<optimized out>)
    at /usr/local/src/swoole/src/memory/fixed_pool.cc:149
#6  0x00007fb4f2fc3c60 in swoole::Table::alloc_row (this=0x7fb4d2c55010)
    at /usr/local/src/swoole/include/swoole_table.h:278
#7  swoole::Table::set (this=this@entry=0x7fb4d2c55010, key=0x7faf725f36f8 "48142122:5", keylen=10,
    rowlock=rowlock@entry=0x39d02f8, out_flags=out_flags@entry=0x39d02cc)
    at /usr/local/src/swoole/src/memory/table.cc:322
#8  0x00007fb4f2f9c711 in zim_swoole_table_incr (execute_data=0x7faf72250600, return_value=0x7faf72250560)
    at /usr/local/src/swoole/ext-src/swoole_table.cc:369
#9  0x00000000007f82cd in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER ()
    at /usr/local/src/php/Zend/zend_vm_execute.h:1863
#10 execute_ex (ex=0x2) at /usr/local/src/php/Zend/zend_vm_execute.h:55176
#11 0x00007fb4f2f1d2dd in swoole::PHPCoroutine::main_func (arg=0x7fff63934bb0)
    at /usr/local/src/swoole/ext-src/swoole_coroutine.cc:700
#12 0x00007fb4f2fac680 in std::function<void (void*)>::operator()(void*) const (__args#0=<optimized out>,
    this=0x3b60910) at /usr/include/c++/8/bits/std_function.h:682
#13 swoole::coroutine::Context::context_func (arg=0x3b60910)
    at /usr/local/src/swoole/src/coroutine/context.cc:142
#14 0x00007fb4f302fd31 in swoole_make_fcontext ()
    at /usr/local/src/swoole/thirdparty/boost/asm/make_x86_64_sysv_elf_gas.S:70
#15 0x0000000000200010 in ?? ()
#16 0x0000000000000021 in ?? ()
#17 0x0000000003a492e0 in ?? ()
#18 0x00007fb4f6d65c30 in main_arena () from /usr/lib64/libc.so.6
#19 0x0000000000000020 in ?? ()
#20 0x00000000000000a0 in ?? ()
#21 0x0000000000000040 in ?? ()
#22 0x0000000000000000 in ?? ()

两者的报错还不太一样。 @ @NathanFreeman 请问这两者我需要部署哪个版本更利于你们分析?

NathanFreeman commented 1 year ago

还要看看gdb之后, f 0 进入alloc函数,执行完FixedPoolSlice *slice = impl->head;这行代码之后的slice和impl->head的具体值。 p *slice p *impl->head

wuyiwai commented 1 year ago

p *impl->head

(gdb) p *slice
Cannot access memory at address 0x0
(gdb) p *impl->head
Cannot access memory at address 0x0
NathanFreeman commented 1 year ago

谢谢

matyhtf commented 1 year ago

@Wuyiwai 是否有其他进程执行了 Table::destroy() ?是否使用了一些多线程方面的扩展,比如 pthreads 、grpc、parallels

另外日志中有 status=255 可能是存在 PHP 致命错误,需要检查一下 php 的错误日志。

wuyiwai commented 1 year ago
  1. 是否有其他进程执行了 Table::destroy() ?> 目前代码没有显示调用的逻辑
  2. grpc的扩展有装,但代码里对这三个拓展都没有使用到比较原生的用法?用的都是swoole比较常见的一些东西:协程、table、Timer这些。如果我这样描述不够清晰的话,方便提供一些应用相关的思路吗?
  3. php_errors.log 的确有一些日志。不过感觉可能相关性不是特别大,我贴上来看看:
    1. Waring错误:
      [14-Dec-2022 20:51:26 Asia/Shanghai] PHP Warning:  Invalid "opcache.jit" setting. Should be "disable", "on", "off", "tracing", "function" or 4-digit number in Unknown on line 0
    2. Waring错误:该错误是由于之前在k8s上面部署服务时,由于之前踩过grpc扩展跟swoole冲突的坑,所以在k8s的yaml配置上grpc相关的扩展配置成''导致只要运行swoole就会报这个warnnig的错误,但不是fatal error
      [16-Jan-2023 10:09:26 UTC] PHP Warning:  PHP Startup: Unable to load dynamic library '' (tried: /usr/local/php/lib/php/extensions/no-debug-non-zts-20210902/ (/usr/local/php/lib/php/extensions/no-debug-non-zts-20210902/: cannot read file data: Is a directory), /usr/local/php/lib/php/extensions/no-debug-non-zts-20210902/.so (/usr/local/php/lib/php/extensions/no-debug-non-zts-20210902/.so: cannot open shared object file: No such file or directory)) in Unknown on line 0

      目前只记录到这两种错误。如果后续还需要什么信息我会继续跟进提供 @matyhtf

wuyiwai commented 1 year ago

还发现了一些别的coredump分析

warning: Can't open file (null) during file-backed mapping note processing
[New LWP 81667]
[New LWP 81668]
[New LWP 81669]
[New LWP 81670]
[New LWP 81671]
[New LWP 81672]
[New LWP 81673]
[New LWP 81674]
[New LWP 81675]
[New LWP 81676]
[New LWP 81677]
[New LWP 81678]
[New LWP 81679]
[New LWP 81680]
[New LWP 81681]
[New LWP 81682]
[New LWP 81683]
[New LWP 81684]
[New LWP 81685]
[New LWP 81686]
[New LWP 81687]
[New LWP 81688]
[New LWP 81689]
[New LWP 81690]
[New LWP 81691]
[New LWP 81692]
[New LWP 81693]
[New LWP 81694]
[New LWP 81695]
[New LWP 81696]
[New LWP 81697]
[New LWP 81698]
[New LWP 81699]

warning: Error reading shared library list entry at 0x294e0

warning: Error reading shared library list entry at 0x2528323340382d00
Failed to read a valid object file image from memory.
Core was generated by `mail unread server 3 event worker                                             '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f9b35406378 in ?? ()
[Current thread is 1 (LWP 81667)]
(gdb) bt
#0  0x00007f9b35406378 in ?? ()
#1  0x00007f9b35407fb0 in ?? ()
#2  0x00000000027eae58 in ?? ()
#3  0x0000000a00000000 in ?? ()
#4  0x00007f9b14d456d8 in ?? ()
#5  0x00007f9b14d94ac0 in ?? ()
#6  0x00007f9b14ff0010 in ?? ()
#7  0x00007f9b14d5f678 in ?? ()
#8  0x0000001a00000002 in ?? ()
#9  0x00007f9b14d94420 in ?? ()
#10 0x00007f9b14d94900 in ?? ()
#11 0x00007f9b353e0a21 in ?? ()
#12 0x0000000000000001 in ?? ()
#13 0x00007f9b14d44298 in ?? ()
#14 0x0000000000000000 in ?? ()
matyhtf commented 1 year ago

@Wuyiwai 上面的 core 信息无效了,堆栈已破坏,无法分析问题。可以关闭 grpc 试试,另外在 gdb core 中可以使用 info threads 看看是否有其他的线程正在执行什么操作。

status 255 比较可以,包括 php error_log 、swoole log_file 都需要看一下,是否存在 fatal error

另外对 table 的操作只有 set、incr、del 这些吗?key 就是类似于 '48142122:5' 这样的格式,请提供一下 table create 时的参数

wuyiwai commented 1 year ago

@matyhtf

  1. 创建table时的参数
    $table = new Table((int)(256 * 65536));
    $table->column('count', 1 4);
    $table->column('company_count',1, 4);
    $table->column('colleague_count', 1, 4);
    $table->column('contact_count', 1, 4);
    $table->create();
    $this->userMailFolderMap = $table;
  2. key是类似"48142122:5",用":"做连接符
  3. 目前只有incr和se和del操作,set一般是初始化的时候会用,incr用的比较多,有些场景是可能不存在该key就incr了,但我查了文档和自测,跟文档描述的一样,incr key不存在时默认0,这个没报错。
  4. 目前php error_log 只有
    
    [14-Dec-2022 20:51:26 Asia/Shanghai] PHP Warning:  Invalid "opcache.jit" setting. Should be "disable", "on", "off", "tracing", "function" or 4-digit number in Unknown on line 0

[16-Jan-2023 10:09:26 UTC] PHP Warning: PHP Startup: Unable to load dynamic library '' (tried: /usr/local/php/lib/php/extensions/no-debug-non-zts-20210902/ (/usr/local/php/lib/php/extensions/no-debug-non-zts-20210902/: cannot read file data: Is a directory), /usr/local/php/lib/php/extensions/no-debug-non-zts-20210902/.so (/usr/local/php/lib/php/extensions/no-debug-non-zts-20210902/.so: cannot open shared object file: No such file or directory)) in Unknown on line 0

swoole log_file 目前在k8s上设置为false所以没有收集到文件,是配置了日志收集stdout输出的日志,不过去搜了发现有一些新的结果,可以留意下,大概有一下几种:

[2023-01-17 02:58:11 *37353.3] WARNING TableRow::lock(): lock process[36787] not exists, force unlock

[2023-01-17 02:51:13 *36961.0] WARNING TableRow::lock(): timeout, force unlock

[2023-01-17 02:59:32 $13.0] WARNING Server::check_worker_exit_status(): worker(pid=37127, id=2) abnormal exit, status=0, signal=11

WARNING Server::check_worker_exit_status(): worker(pid=31972, id=3) abnormal exit, status=255, signal=0

6. info threads

(gdb) info threads Id Target Id Frame

wuyiwai commented 1 year ago

我去看了下 源码里面的 table.cc的lock方法的描述:

lock process[%d] not exists, force unlock 对应
/**
 * The process occupied by the resource no longer exists,
 * indicating that OOM occurred during the locking process,
 * forced to unlock
 */

timeout, force unlock 对应:
/**
 * The deadlock time exceeds 2 seconds (SW_TABLE_FORCE_UNLOCK_TIME),
 * indicating that the lock process has OOM,
 * and the PID has been reused, forcing the unlock
 */

不过我没能想到我的代码里有哪些应用场景有可能会触发这些。

  1. 目前swoole应用层没有能调用lock和unlock的地方,之前移除了。我看了下fixed_pool.cc,感觉可能是在set或者incr时会先获取内存,但pool满了,拿不到就报错了。我基于这个猜测展开下面的定位:
  2. 我目前的业务代码有两处调用set和incr。
    1. 初始化业务数据时
      $key = $this->buildUserMailFolderKey($userMailId, $folderId);
      $countData = $countMap[$userMailId][$folderId] ?? $defaultCountData;
      $setRet = $this->userMailFolderMap->set($key, $countData);
    2. 对业务数据进行更新时
      foreach ($params as $item)
      {
      $key = $this->buildUserMailFolderKey($userMailId, $folderId);
      $countRet = $this->userMailFolderMap->incr($key, 'count', $count);
      $companyCountRet = $this->userMailFolderMap->incr($key, 'company_count', $companyCount);
      $colleagueCountRet = $this->userMailFolderMap->incr($key, 'colleague_count', $colleagueCount);
      $contactCountRet = $this->userMailFolderMap->incr($key, 'contact_count', $contactCount);
      }
  3. 业务代码只有上面两处是有做set和incr的操作的。我预先分配了 256 * 65536 的size,按理说也不会不够用。目前的思路是以上这些 @matyhtf @NathanFreeman
wuyiwai commented 1 year ago

@matyhtf @NathanFreeman 我还需要提供什么信息以推进该问题的定位吗

wuyiwai commented 1 year ago

@matyhtf @NathanFreeman 刚刚我去对比了swoole的版本更新日志,涉及table的有多个迭代,其中一些bug修复提到了修复相关的内存错误,我想到之前应该是有issue或者有什么方式是可以正确的反馈能帮助定位table的问题的。我额外想到一些不知道有没有关系的信息:我之前的服务在4.4.16版本上运行了2年多没有报错,在5.0.1、4.8版本上运行后的core dump里有关于table的错误。以下是关于table的更新的一些列举:

matyhtf commented 1 year ago

@Wuyiwai 可能是在hash冲突分配动态内存时发生了 OOM,查看 dmesg -T 日志,是否有 OOM 信息,错误日志中 lock pid 检测说明有一个 worker 进程在 lock 阶段异常退出了。不清楚 crash 的顺序:

  1. 是先发生 crash ,然后出现 lock pid not exists 错误
  2. 还是先有 OOM ,然后出现 lock pid not exists 错误,最后第二个 worker 进程发生 crash
wuyiwai commented 1 year ago
$ dmesg -T | grep --color -E 'kill|oom|out of memory'
$ 无日志

$ dmesg -T | grep --color -E 'error|warn|fatal|fail|oom'
没有发现oom的信息,我将其他日志贴上来,信息有些多,我大概省略一些用...表示相同部分

[Tue Jan 17 05:05:18 2023] php[811442]: segfault at 0 ip 00007fe44fe06378 sp 00000000026e86f8 error 4 in swoole.so[7fe44fc92000+291000]
...
[Tue Jan 17 20:06:49 2023] php[923999]: segfault at 10 ip 00007f3faf606390 sp 0000000002dc6cb8 error 6 in swoole.so[7f3faf492000+291000]
[Tue Jan 17 20:46:06 2023] php[2141929]: segfault at 0 ip 00007f3faf606378 sp 00000000023c19a8 error 4 in swoole.so[7f3faf492000+291000]
...

其中error 6比较少,error 4比较多

 * Page fault error code bits:
 *
 *   bit 0 == 0: no page found1: protection fault
 *   bit 1 == 0: read access1: write access
 *   bit 2 == 0: kernel-mode access1: user-mode access
 *   bit 3 == 1: use of reserved bit detected
 *   bit 4 == 1: fault was an instruction fetch

先针对error 4进行分析
error 4(100 = user-mode access & read access): IP 指令指针寄存器 地址为 00007f3faf606378 偏移量 7f3faf492000 => 0x174378
error 6(110 = user-mode access & write access): IP 指令指针寄存器 地址为 00007f3faf606390 偏移量 7f3faf492000 => 0x174390

$ addr2line -e swoole.so -fCi 0x174378
swoole::FixedPool::alloc(unsigned int)
/usr/local/src/swoole/src/memory/fixed_pool.cc:151

$ addr2line -e swoole.so -fCi 0x174390
swoole::FixedPool::alloc(unsigned int)
/usr/local/src/swoole/src/memory/fixed_pool.cc:162

相关`fixed_pool.cc`的5.0.1版本代码
void *FixedPool::alloc(uint32_t size) {
    FixedPoolSlice *slice = impl->head;
    if (slice->lock) { // 第151行
        swoole_set_last_error(SW_ERROR_MALLOC_FAIL);
        assert(get_number_of_spare_slice() == 0);
        return nullptr;
    }

    slice->lock = 1;
    impl->slice_use++;

    // move next slice to head (idle list)
    impl->head = slice->next;
    impl->head->prev = nullptr; // 第162行

    // move this slice to tail (busy list)
    impl->tail->next = slice;
    slice->next = nullptr;
    slice->prev = impl->tail;
    impl->tail = slice;

    return slice->data;
}

@matyhtf

matyhtf commented 1 year ago

@Wuyiwai 加一下我的微信,19921030512 方便沟通,需要本地压测试试看能不能重现

wuyiwai commented 11 months ago

重新用php 8.1.20 和swoole 5.0.3 部署了还是会有这个情况

php --ri swoole

swoole

Swoole => enabled
Author => Swoole Team <team@swoole.com>
Version => 5.1.0-dev
Built => Jun 20 2023 04:02:10
coroutine => enabled with boost asm context
epoll => enabled
eventfd => enabled
signalfd => enabled
cpu_affinity => enabled
spinlock => enabled
rwlock => enabled
openssl => OpenSSL 1.1.1k  FIPS 25 Mar 2021
dtls => enabled
http2 => enabled
json => enabled
curl-native => enabled
zlib => 1.2.11
mutex_timedlock => enabled
pthread_barrier => enabled
futex => enabled
async_redis => enabled
coroutine_pgsql => enabled

Directive => Local Value => Master Value
swoole.enable_coroutine => On => On
swoole.enable_library => On => On
swoole.enable_fiber_mock => Off => Off
swoole.enable_preemptive_scheduler => Off => Off
swoole.display_errors => On => On
swoole.use_shortname => On => On
swoole.unixsock_buffer_size => 8388608 => 8388608

sh-4.4# php -v
PHP 8.1.20 (cli) (built: Jun 20 2023 03:59:52) (NTS)
Copyright (c) The PHP Group
Zend Engine v4.1.20, Copyright (c) Zend Technologies
    with Zend OPcache v8.1.20, Copyright (c), by Zend Technologies

最新捕捉的上下文: coredump相关

(gdb) bt
#0  0x00007fdaf4ad3acf in raise () from /usr/lib64/libc.so.6
#1  0x00007fdaf4aa6ea5 in abort () from /usr/lib64/libc.so.6
#2  0x00007fdaf4aa6d79 in __assert_fail_base.cold.0 () from /usr/lib64/libc.so.6
#3  0x00007fdaf4acc426 in __assert_fail () from /usr/lib64/libc.so.6
#4  0x00007fdaf0fef262 in swoole::FixedPool::alloc (this=0x32d0630, size=<optimized out>)
    at /usr/local/src/swoole-src-master/src/memory/fixed_pool.cc:149
#5  swoole::FixedPool::alloc (this=0x32d0630, size=<optimized out>) at /usr/local/src/swoole-src-master/src/memory/fixed_pool.cc:149
#6  0x00007fdaf0ff0dc0 in swoole::Table::alloc_row (this=0x7fdad0c6c010) at /usr/local/src/swoole-src-master/include/swoole_table.h:278
#7  swoole::Table::set (this=this@entry=0x7fdad0c6c010, key=key@entry=0x7fd56fe97d08 "48221197:2520849", keylen=keylen@entry=16,
    rowlock=rowlock@entry=0x35c55f8, out_flags=out_flags@entry=0x35c55f4) at /usr/local/src/swoole-src-master/src/memory/table.cc:318
#8  0x00007fdaf0fc784f in zim_swoole_table_set (execute_data=0x7fd56fc40a20, return_value=0x7fd56fc409a0)
    at /usr/local/src/swoole-src-master/ext-src/swoole_table.cc:295
#9  0x00000000007f7744 in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER () at /usr/local/src/php/Zend/zend_vm_execute.h:1870
#10 execute_ex (ex=0x2) at /usr/local/src/php/Zend/zend_vm_execute.h:55811
#11 0x0000000000783927 in zend_call_function (fci=fci@entry=0x7fd5ed0e3a68, fci_cache=fci_cache@entry=0x7fd5ed0e3aa8)
    at /usr/local/src/php/Zend/zend_execute_API.c:929
#12 0x00007fdaf0f3c8b1 in swoole::PHPCoroutine::main_func (_args=<optimized out>)
    at /usr/local/src/swoole-src-master/ext-src/swoole_coroutine.cc:694
#13 0x00007fdaf0fd7480 in std::function<void (void*)>::operator()(void*) const (__args#0=<optimized out>, this=0x36530d0)
    at /usr/include/c++/8/bits/std_function.h:682
#14 swoole::coroutine::Context::context_func (arg=0x36530d0) at /usr/local/src/swoole-src-master/src/coroutine/context.cc:142
#15 0x00007fdaf1061991 in swoole_make_fcontext ()
    at /usr/local/src/swoole-src-master/thirdparty/boost/asm/make_x86_64_sysv_elf_gas.S:70
#16 0x0000000000200010 in ?? ()
#17 0x0000000000000111 in ?? ()
#18 0x000000060000011a in ?? ()
#19 0x0000000000000001 in ?? ()
#20 0x0000000002000d00 in ?? ()
#21 0x0000000000000000 in ?? ()

9点54分:同时间的业务日志:

event worker: /usr/local/src/swoole-src-master/src/memory/fixed_pool.cc:153: virtual void* swoole::FixedPool::alloc(uint32_t): Assertion `get_number_of_spare_slice() == 0' failed.

相关代码如下:

foreach ($map as $key => $item)
{
    $table->set($key, $value);
}
也没复杂调用,可以确认没有并发 @matyhtf @NathanFreeman 

下面是我的一些额外的猜想: 程序初始化table的时候代码如下:

$table = new Table((int)(256 * 65536));  也就是size = 16777216 table默认的$conflict_proportion是0.2
$table->column('count', Table::TYPE_INT, 4);
$table->column('company_count',Table::TYPE_INT, 4);
$table->column('colleague_count', Table::TYPE_INT, 4);
$table->column('contact_count', Table::TYPE_INT, 4);
$table->create();
$this->userMailFolderMap = $table;

为了尽可能捕捉报错的信息,我额外打印了一些日志:

              [tablename] [size]                          [memory]                              [percent]
9.52分:[    map    ] [         3512268         ] [   0.45000015199184 G    ] [          20.93          ]

更早一些的日志:

9:42 [    userMailFolderMap    ] [         3500278         ] [   0.45000015199184 G    ] [          20.86          ]
9:32 [    userMailFolderMap    ] [         3487470         ] [   0.45000015199184 G    ] [          20.79          ]
9:22 [    userMailFolderMap    ] [         3472632         ] [   0.45000015199184 G    ] [          20.7           ]

这个记录是十分钟记录一次memory信息 .不清楚这个conflict_proportion = 0.2 和table的设置有没有关系