miaozhiqiang / memcached

Automatically exported from code.google.com/p/memcached
0 stars 0 forks source link

Memcached 1.4.23 crashes on stats command #403

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. "memcached -d"
2. telnet localhost 11211
3. stats
-> CRASH (100% reproducible)

Regression over 1.4.22
Tested on FreeBSD 10.1

Original issue reported on code.google.com by arne.cl...@gmail.com on 20 Apr 2015 at 8:07

GoogleCodeExporter commented 8 years ago
Can you please provide a backtrace? This definitely doesn't happen on my 
platforms.

Original comment by dorma...@rydia.net on 20 Apr 2015 at 7:00

GoogleCodeExporter commented 8 years ago
Built with ./configure -O0 -g
FreeBSD 10.1-RELEASE-p9 #0
Memcached 1.4.23
libevent2-2.0.22_1

GDB output

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 801806800 (LWP 100454/memcached)]
0x0000000800854631 in pthread_mutex_destroy () from /lib/libthr.so.3
(gdb) bt
#0  0x0000000800854631 in pthread_mutex_destroy () from /lib/libthr.so.3
#1  0x0000000000418aef in item_stats_totals (add_stats=0x40e5a0 <append_stats>, 
c=0x80195a200) at items.c:506
#2  0x00000000004145a6 in get_stats (stat_type=0x0, nkey=0, add_stats=0x40e5a0 
<append_stats>, c=0x80195a200) at slabs.c:309
#3  0x0000000000412026 in process_stat (c=0x80195a200, tokens=0x7fffffbfdbc0, 
ntokens=2) at memcached.c:2808
#4  0x00000000004100a8 in process_command (c=0x80195a200, command=0x801967800 
"stats") at memcached.c:3491
#5  0x0000000000409c95 in try_read_command (c=0x80195a200) at memcached.c:3796
#6  0x00000000004086a8 in drive_machine (c=0x80195a200) at memcached.c:4141
#7  0x000000000040380e in event_handler (fd=26, which=2, arg=0x80195a200) at 
memcached.c:4386
#8  0x0000000800c7b3be in event_base_loop () from 
/usr/local/lib/libevent-2.0.so.5
#9  0x000000000041edcb in worker_libevent (arg=0x801924000) at thread.c:379
#10 0x000000080084e4f5 in pthread_create () from /lib/libthr.so.3
#11 0x0000000000000000 in ?? ()
(gdb) info threads
  6 Thread 801806400 (LWP 100441/memcached)  0x0000000800fd204a in kevent () from /lib/libc.so.7
* 5 Thread 801806800 (LWP 100454/memcached)  0x0000000800854631 in 
pthread_mutex_destroy () from /lib/libthr.so.3
  4 Thread 801806c00 (LWP 100455/memcached)  0x0000000800fd204a in kevent () from /lib/libc.so.7
  3 Thread 801807000 (LWP 100456/memcached)  0x0000000800fd204a in kevent () from /lib/libc.so.7
  2 Thread 801807400 (LWP 100457/memcached)  0x0000000800fd204a in kevent () from /lib/libc.so.7
  1 Thread 801807800 (LWP 100458/memcached)  0x00000008008598cc in __error () from /lib/libthr.so.3
(gdb) thread 1
[Switching to thread 1 (Thread 801807800 (LWP 100458/memcached))]#0  
0x00000008008598cc in __error () from /lib/libthr.so.3
(gdb) bt
#0  0x00000008008598cc in __error () from /lib/libthr.so.3
#1  0x0000000800857d8e in _pthread_cond_wait () from /lib/libthr.so.3
#2  0x000000000041cd2d in assoc_maintenance_thread (arg=0x0) at assoc.c:261
#3  0x000000080084e4f5 in pthread_create () from /lib/libthr.so.3
#4  0x0000000000000000 in ?? ()
(gdb) thread 6
[Switching to thread 6 (Thread 801806400 (LWP 100441/memcached))]#0  
0x0000000800fd204a in kevent () from /lib/libc.so.7
(gdb) bt
#0  0x0000000800fd204a in kevent () from /lib/libc.so.7
#1  0x0000000800c8de58 in evutil_secure_rng_add_bytes () from 
/usr/local/lib/libevent-2.0.so.5
#2  0x0000000800c7afa3 in event_base_loop () from 
/usr/local/lib/libevent-2.0.so.5
#3  0x00000000004067e9 in main (argc=2, argv=0x7fffffffeba0) at memcached.c:5724
(gdb)

Original comment by arne.cl...@gmail.com on 21 Apr 2015 at 6:49

GoogleCodeExporter commented 8 years ago
Some more info on Stack Frame 1 (item_stats_totals)

(gdb) frame 1
#1  0x0000000000418aef in item_stats_totals (add_stats=0x40e5a0 <append_stats>, 
c=0x80195a200) at items.c:506
506             pthread_mutex_lock(&lru_locks[i]);
Current language:  auto; currently minimal
(gdb) list
501     for (n = 0; n < MAX_NUMBER_OF_SLAB_CLASSES; n++) {
502         int x;
503         int i;
504         for (x = 0; x < 4; x++) {
505             i = n | lru_type_map[x];
506             pthread_mutex_lock(&lru_locks[i]);
507             totals.expired_unfetched += itemstats[i].expired_unfetched;
508             totals.evicted_unfetched += itemstats[i].evicted_unfetched;
509             totals.evicted += itemstats[i].evicted;
510             totals.reclaimed += itemstats[i].reclaimed;
(gdb) print i
$1 = 255
(gdb) print n
$2 = 63
(gdb) print lru_locks
$3 = {0x801818150, 0x8018181a0, 0x8018181f0, 0x801818240, 0x801818290, 
0x8018182e0, 0x801818330, 0x801818380, 0x8018183d0, 0x801818420, 0x801818470,
  0x8018184c0, 0x801818510, 0x801818560, 0x8018185b0, 0x801818600, 0x801818650, 0x8018186a0, 0x8018186f0, 0x801818740, 0x801818790, 0x8018187e0, 0x801818830,
  0x801818880, 0x8018188d0, 0x801818920, 0x801818970, 0x8018189c0, 0x801818a10, 0x801818a60, 0x801818ab0, 0x801818b00, 0x801818b50, 0x801818ba0, 0x801818bf0,
  0x801818c40, 0x801818c90, 0x801818ce0, 0x801818d30, 0x801818d80, 0x801818dd0, 0x801818e20, 0x801818e70, 0x801818ec0, 0x801818f10, 0x801818f60, 0x801818fb0,
  0x8018c5060, 0x8018c50b0, 0x8018c5100, 0x8018c5150, 0x8018c51a0, 0x8018c51f0, 0x8018c5240, 0x8018c5290, 0x8018c52e0, 0x8018c5330, 0x8018c5380, 0x8018c53d0,
  0x8018c5420, 0x8018c5470, 0x8018c54c0, 0x8018c5510, 0x8018c5560, 0x8018c55b0, 0x8018c5600, 0x8018c5650, 0x8018c56a0, 0x8018c56f0, 0x8018c5740, 0x8018c5790,
  0x8018c57e0, 0x8018c5830, 0x8018c5880, 0x8018c58d0, 0x8018c5920, 0x8018c5970, 0x8018c59c0, 0x8018c5a10, 0x8018c5a60, 0x8018c5ab0, 0x8018c5b00, 0x8018c5b50,
  0x8018c5ba0, 0x8018c5bf0, 0x8018c5c40, 0x8018c5c90, 0x8018c5ce0, 0x8018c5d30, 0x8018c5d80, 0x8018c5dd0, 0x8018c5e20, 0x8018c5e70, 0x8018c5ec0, 0x8018c5f10,
  0x8018c5f60, 0x8018c5fb0, 0x8018c6060, 0x8018c60b0, 0x8018c6100, 0x8018c6150, 0x8018c61a0, 0x8018c61f0, 0x8018c6240, 0x8018c6290, 0x8018c62e0, 0x8018c6330,
  0x8018c6380, 0x8018c63d0, 0x8018c6420, 0x8018c6470, 0x8018c64c0, 0x8018c6510, 0x8018c6560, 0x8018c65b0, 0x8018c6600, 0x8018c6650, 0x8018c66a0, 0x8018c66f0,
  0x8018c6740, 0x8018c6790, 0x8018c67e0, 0x8018c6830, 0x8018c6880, 0x8018c68d0, 0x8018c6920, 0x8018c6970, 0x8018c69c0, 0x8018c6a10, 0x8018c6a60, 0x8018c6ab0,
  0x8018c6b00, 0x8018c6b50, 0x8018c6ba0, 0x8018c6bf0, 0x8018c6c40, 0x8018c6c90, 0x8018c6ce0, 0x8018c6d30, 0x8018c6d80, 0x8018c6dd0, 0x8018c6e20, 0x8018c6e70,
  0x8018c6ec0, 0x8018c6f10, 0x8018c6f60, 0x8018c6fb0, 0x8018c7060, 0x8018c70b0, 0x8018c7100, 0x8018c7150, 0x8018c71a0, 0x8018c71f0, 0x8018c7240, 0x8018c7290,
  0x8018c72e0, 0x8018c7330, 0x8018c7380, 0x8018c73d0, 0x8018c7420, 0x8018c7470, 0x8018c74c0, 0x8018c7510, 0x8018c7560, 0x8018c75b0, 0x8018c7600, 0x8018c7650,
  0x8018c76a0, 0x8018c76f0, 0x8018c7740, 0x8018c7790, 0x8018c77e0, 0x8018c7830, 0x8018c7880, 0x8018c78d0, 0x8018c7920, 0x8018c7970, 0x8018c79c0, 0x8018c7a10,
  0x8018c7a60, 0x8018c7ab0, 0x8018c7b00, 0x8018c7b50, 0x8018c7ba0, 0x8018c7bf0, 0x8018c7c40, 0x8018c7c90, 0x8018c7ce0, 0x8018c7d30, 0x8018c7d80, 0x8018c7dd0,
  0x8018c7e20, 0x8018c7e70, 0x8018c7ec0, 0x8018c7f10, 0x8018c7f60, 0x8018c7fb0, 0x8018c8060, 0x8018c80b0, 0x8018c8100...}
(gdb) print lru_locks[i]
$4 = 0xc
(gdb) print x
$5 = 3
(gdb) print lru_type_map[x]
$6 = 192
(gdb)

Original comment by arne.cl...@gmail.com on 21 Apr 2015 at 6:58

GoogleCodeExporter commented 8 years ago
#define POWER_LARGEST  255
pthread_mutex_t lru_locks[POWER_LARGEST];

I guess there is a +1 missing for the second line as
- MAX_NUMBER_OF_SLAB_CLASSES = 64
- n = [0..63]
- lru_type_map[3] == NOEXP_LRU == 192 == 0b11000000
- 63 = 0b00111111
=> 255 is totally reasonable

Original comment by arne.cl...@gmail.com on 21 Apr 2015 at 7:14

GoogleCodeExporter commented 8 years ago
items.h:27 -> extern pthread_mutex_t lru_locks[POWER_LARGEST+1];
thread.c:40 -> pthread_mutex_t lru_locks[POWER_LARGEST+1];
thread.c:728 -> for (i = 0; i <= POWER_LARGEST; i++) {

Fixes the problem for me but I did not check if it does make sense with all the 
arrays depending on POWER_LARGEST.

Original comment by arne.cl...@gmail.com on 21 Apr 2015 at 7:32

GoogleCodeExporter commented 8 years ago
Shit. I fixed an off-by-one with MAX_SLAB_ETC as one of the last commits. Guess 
that pushed the POWER_LARGEST value over the limit :(

Wonder why that doesn't crash for me. I'll take a more careful look at this as 
soon as I can.

Original comment by dorma...@rydia.net on 21 Apr 2015 at 7:46

GoogleCodeExporter commented 8 years ago
I don't get a hard crash here, but I do get a hard lockup. This looks like a 
similar situation to the comment I just made on #404- things get weird when 
hardware lock elision comes into play. It works fine, no lockups, on my older 
x86_64 machine.

0x00007f9c3d9b2883 in epoll_wait () from /usr/lib/libc.so.6
(gdb) thread apply all backtrace

Thread 6 (Thread 0x7f9c3d8c9700 (LWP 12250)):
#0  0x00007f9c3d9b2883 in epoll_wait () from /usr/lib/libc.so.6
#1  0x00007f9c3deaf4d8 in ?? () from /usr/lib/libevent-2.0.so.5
#2  0x00007f9c3de9a61a in event_base_loop () from /usr/lib/libevent-2.0.so.5
#3  0x000000000041add4 in worker_libevent (arg=0x817480) at thread.c:379
#4  0x00007f9c3dc74374 in start_thread () from /usr/lib/libpthread.so.0
#5  0x00007f9c3d9b227d in clone () from /usr/lib/libc.so.6

Thread 5 (Thread 0x7f9c3d0c8700 (LWP 12251)):
#0  0x00007f9c3dc7c64c in __lll_lock_wait () from /usr/lib/libpthread.so.0
#1  0x00007f9c3dc7f090 in __lll_lock_elision () from /usr/lib/libpthread.so.0
#2  0x00000000004183c9 in item_stats_totals (add_stats=add_stats@entry=0x4067d0 
<append_stats>, c=c@entry=0x7f9c34026200) at items.c:506
#3  0x0000000000414c57 in get_stats (stat_type=stat_type@entry=0x0, 
nkey=nkey@entry=0, add_stats=add_stats@entry=0x4067d0 <append_stats>, 
c=c@entry=0x7f9c34026200)
    at slabs.c:309
#4  0x000000000040ef1d in process_stat (ntokens=<optimized out>, 
tokens=0x7f9c3d0c7cd0, c=0x7f9c34026200) at memcached.c:2808
#5  process_command (c=c@entry=0x7f9c34026200, command=<optimized out>) at 
memcached.c:3491
#6  0x000000000040fc39 in try_read_command (c=0x7f9c34026200) at 
memcached.c:3796
#7  drive_machine (c=0x7f9c34026200) at memcached.c:4141
#8  event_handler (fd=<optimized out>, which=<optimized out>, 
arg=0x7f9c34026200) at memcached.c:4386
#9  0x00007f9c3de9aca6 in event_base_loop () from /usr/lib/libevent-2.0.so.5
#10 0x000000000041add4 in worker_libevent (arg=0x8185c8) at thread.c:379
#11 0x00007f9c3dc74374 in start_thread () from /usr/lib/libpthread.so.0
#12 0x00007f9c3d9b227d in clone () from /usr/lib/libc.so.6

Thread 4 (Thread 0x7f9c3c8c7700 (LWP 12252)):
#0  0x00007f9c3d9b2883 in epoll_wait () from /usr/lib/libc.so.6
#1  0x00007f9c3deaf4d8 in ?? () from /usr/lib/libevent-2.0.so.5
#2  0x00007f9c3de9a61a in event_base_loop () from /usr/lib/libevent-2.0.so.5
#3  0x000000000041add4 in worker_libevent (arg=0x819710) at thread.c:379
#4  0x00007f9c3dc74374 in start_thread () from /usr/lib/libpthread.so.0
#5  0x00007f9c3d9b227d in clone () from /usr/lib/libc.so.6

Thread 3 (Thread 0x7f9c3c0c6700 (LWP 12253)):
#0  0x00007f9c3d9b2883 in epoll_wait () from /usr/lib/libc.so.6
#1  0x00007f9c3deaf4d8 in ?? () from /usr/lib/libevent-2.0.so.5
#2  0x00007f9c3de9a61a in event_base_loop () from /usr/lib/libevent-2.0.so.5
#3  0x000000000041add4 in worker_libevent (arg=0x81a858) at thread.c:379
#4  0x00007f9c3dc74374 in start_thread () from /usr/lib/libpthread.so.0
#5  0x00007f9c3d9b227d in clone () from /usr/lib/libc.so.6

Thread 2 (Thread 0x7f9c3b8c5700 (LWP 12254)):
#0  0x00007f9c3dc799af in pthread_cond_wait@@GLIBC_2.3.2 () from 
/usr/lib/libpthread.so.0
#1  0x000000000041a510 in assoc_maintenance_thread (arg=<optimized out>) at 
assoc.c:261
#2  0x00007f9c3dc74374 in start_thread () from /usr/lib/libpthread.so.0
#3  0x00007f9c3d9b227d in clone () from /usr/lib/libc.so.6

Thread 1 (Thread 0x7f9c3e2de700 (LWP 12249)):
#0  0x00007f9c3d9b2883 in epoll_wait () from /usr/lib/libc.so.6
#1  0x00007f9c3deaf4d8 in ?? () from /usr/lib/libevent-2.0.so.5
#2  0x00007f9c3de9a61a in event_base_loop () from /usr/lib/libevent-2.0.so.5
#3  0x0000000000404eea in main (argc=0, argv=0x0) at memcached.c:5724

works:

vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Core(TM)2 Quad CPU    Q9400  @ 2.66GHz
stepping        : 10
microcode       : 0xa0b
cpu MHz         : 2000.000
cache size      : 3072 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm 
constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor 
ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm dtherm 
tpr_shadow vnmi flexpriority

broken:

vendor_id   : GenuineIntel
cpu family  : 6
model       : 60
model name  : Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
stepping    : 3
microcode   : 0x12
cpu MHz     : 3874.539
cache size  : 8192 KB
physical id : 0
siblings    : 8
core id     : 3
cpu cores   : 4
apicid      : 7
initial apicid  : 7
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat 
pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp 
lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc 
aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 
fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer 
aes xsave avx f16c rdrand lahf_lm abm ida arat pln pts dtherm tpr_shadow vnmi 
flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid 
rtm xsaveopt

Original comment by dpmc...@gmail.com on 23 Apr 2015 at 2:01

GoogleCodeExporter commented 8 years ago
I suppose I forgot to say that the above GDB backtrace was captured after 
running the steps to reproduce. The telnet session returned no response to the 
stats command for me, and then I captured the above backtrace.

Original comment by dpmc...@gmail.com on 23 Apr 2015 at 2:11

GoogleCodeExporter commented 8 years ago
Seems like everything using LARGEST_ID and POWER_LARGEST were comparing via "<" 
already, so upping LARGEST_ID by one should cover all of the other off-by-one 
cases I introduced earlier.

Can you folks please try https://github.com/memcached/memcached (master branch) 
and ping this issue on if it repairs your problem completey? All tests should 
run.

Original comment by dorma...@rydia.net on 24 Apr 2015 at 7:05

GoogleCodeExporter commented 8 years ago
Does not compile with clang:

items.c:274:28: error: comparison of constant 256 with expression of type 
'uint8_t' (aka 'unsigned char') is always true
      [-Werror,-Wtautological-constant-out-of-range-compare]
    assert(it->slabs_clsid < LARGEST_ID);
           ~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~
/usr/include/assert.h:54:21: note: expanded from macro 'assert'
#define assert(e)       ((e) ? (void)0 : __assert(__func__, __FILE__, \
                          ^
items.c:298:28: error: comparison of constant 256 with expression of type 
'uint8_t' (aka 'unsigned char') is always true
      [-Werror,-Wtautological-constant-out-of-range-compare]
    assert(it->slabs_clsid < LARGEST_ID);
           ~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~
/usr/include/assert.h:54:21: note: expanded from macro 'assert'
#define assert(e)       ((e) ? (void)0 : __assert(__func__, __FILE__, \
                          ^
items.c:1097:28: error: comparison of constant 256 with expression of type 
'uint8_t' (aka 'unsigned char') is always true
      [-Werror,-Wtautological-constant-out-of-range-compare]
    assert(it->slabs_clsid < LARGEST_ID);
           ~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~
/usr/include/assert.h:54:21: note: expanded from macro 'assert'
#define assert(e)       ((e) ? (void)0 : __assert(__func__, __FILE__, \
                          ^
items.c:1119:28: error: comparison of constant 256 with expression of type 
'uint8_t' (aka 'unsigned char') is always true
      [-Werror,-Wtautological-constant-out-of-range-compare]
    assert(it->slabs_clsid < LARGEST_ID);
           ~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~
/usr/include/assert.h:54:21: note: expanded from macro 'assert'
#define assert(e)       ((e) ? (void)0 : __assert(__func__, __FILE__, \
                          ^
items.c:1145:28: error: comparison of constant 256 with expression of type 
'uint8_t' (aka 'unsigned char') is always true
      [-Werror,-Wtautological-constant-out-of-range-compare]
    assert(it->slabs_clsid < LARGEST_ID);
           ~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~

Original comment by arne.cl...@gmail.com on 24 Apr 2015 at 7:20

GoogleCodeExporter commented 8 years ago
seriously, clang? I have to go swap all of the <'s and <='s? :/

Well if you're still reading and aren't building with clang I'd like to know if 
this fixes the issue still.

Original comment by dorma...@rydia.net on 24 Apr 2015 at 7:31

GoogleCodeExporter commented 8 years ago
This fixes it for me, all tests now pass.  Great stuff, I shall go ahead and 
test this release hard on a real machine now.

Original comment by markhag...@gmail.com on 24 Apr 2015 at 8:36

GoogleCodeExporter commented 8 years ago
Well it does have a point because that assert does not really make much sense 
as a byte is always < 256.

Original comment by arne.cl...@gmail.com on 24 Apr 2015 at 8:37

GoogleCodeExporter commented 8 years ago
Removed the asserts -> compiles -> stats do now work again.

Original comment by arne.cl...@gmail.com on 24 Apr 2015 at 8:39

GoogleCodeExporter commented 8 years ago
Stats works for me now as well on the box that was unresponsive, once the patch 
is applied. Thanks!

Original comment by dpmc...@gmail.com on 24 Apr 2015 at 2:47

GoogleCodeExporter commented 8 years ago
1.4.24 is out, please try if you haven't already

Original comment by dorma...@rydia.net on 25 Apr 2015 at 9:01