memcached 1.4.23 fails test binary_prependq on Solaris 10 Sparc 64 Bit

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?

Building memcached 1.4.23 and running the testsuite

What is the expected output? What do you see instead?

These are the last lines from the testsuite:
ok 46 - binary_prepend
ok 47 - binary_prependq
Timeout.. killing the process
Assertion failed: nr != 0, file testapp.c, line 725
Makefile:1482: recipe for target 'test' failed
gmake: *** [test] Abort (core dumped)
gmake: Leaving directory 
'/home/dam/mgar/pkg/memcached/trunk/work/solaris10-sparc/build-isa-sparcv9/memca
ched-1.4.23'

What version of the product are you using? On what operating system?
1.4.23, Solaris 10 Sparc with Sun Studio 12

Please provide any additional information below.

Original issue reported on code.google.com by honkma...@googlemail.com on 20 Apr 2015 at 12:53

GoogleCodeExporter commented 8 years ago

There are also some different tests failing on Solaris 10 amd64:

slab class  42: chunk size   1048576 perslab       1
<24 server listening (binary)
<27 server listening (binary)
<28 send buffer was 57344, now 2097152
<32 send buffer was 57344, now 2097152
<31 server listening (udp)
<28 server listening (udp)
<30 server listening (udp)
<29 server listening (udp)
<35 server listening (udp)
<32 server listening (udp)
<34 server listening (udp)
<33 server listening (udp)
<36 new binary client connection.
<36 connection closed.
Invalid value for binding protocol: http
 -- should be one of auto, binary, or ascii
t/00-startup.t ....... 17/18 Number of threads must be greater than 0
t/00-startup.t ....... ok     
t/64bit.t ............ ok   
t/binary-get.t ....... ok   
t/binary-sasl.t ...... This server is not built with SASL support.
t/binary-sasl.t ...... ok   
t/binary.t ........... 1371/3639 
#   Failed test at t/binary.t line 422.
#          got: '1024'
#     expected: '0'
t/binary.t ........... 3621/3639 # Looks like you failed 1 test of 3639.
t/binary.t ........... Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/3639 subtests 
t/bogus-commands.t ... ok   
t/cas.t .............. ok     
t/daemonize.t ........ ok   
t/dash-M.t ........... ok       
t/evictions.t ........ ok     
t/expirations.t ...... ok     
t/flags.t ............ ok   
t/flush-all.t ........ ok     
t/getset.t ........... ok       
t/incrdecr.t ......... ok     
t/issue_104.t ........ ok   
t/issue_108.t ........ ok   
t/issue_14.t ......... ok     
t/issue_140.t ........ skipped: Fix for Issue 140 was only an illusion
t/issue_152.t ........ ok   
t/issue_163.t ........ ok   
t/issue_183.t ........ ok   
t/issue_192.t ........ ok   
t/issue_22.t ......... ok     
t/issue_260.t ........ skipped: Only possible to test #260 under artificial 
conditions
t/issue_29.t ......... ok   
t/issue_3.t .......... ok   
t/issue_41.t ......... ok       
t/issue_42.t ......... ok     
t/issue_50.t ......... ok   
t/issue_61.t ......... ok   
t/issue_67.t ......... ok     
t/issue_68.t ......... ok       
t/issue_70.t ......... ok   
t/item_size_max.t .... 1/7 Item max size cannot be less than 1024 bytes.
t/item_size_max.t .... 2/7 Cannot set item size limit higher than 128 mb.
t/item_size_max.t .... 3/7 WARNING: Setting item max size above 1MB is not 
recommended!
 Raising this limit increases the minimum memory requirements
 and will decrease your memory efficiency.
t/item_size_max.t .... ok   
t/line-lengths.t ..... ok   
t/lru-crawler.t ...... ok       
t/lru-maintainer.t ... 11/224 
#   Failed test 'moved some items to cold'
#   at t/lru-maintainer.t line 33.
#          got: '0'
#     expected: anything else
# Looks like you failed 1 test of 224.
t/lru-maintainer.t ... Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/224 subtests 
t/lru.t .............. ok       
t/maxconns.t ......... ok     
t/multiversioning.t .. ok     
t/noreply.t .......... ok   
t/refhang.t .......... ok       
t/slabs_reassign.t ... ok       
t/stats-conns.t ...... ok     
t/stats-detail.t ..... ok     
t/stats.t ............ 1/97 
#   Failed test at t/stats.t line 178.
#          got: '1024'
#     expected: '0'
# Looks like you failed 1 test of 97.
t/stats.t ............ Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/97 subtests 
t/touch.t ............ ok   
t/udp.t .............. ok     
t/unixsocket.t ....... ok   
t/whitespace.t ....... fatal: Not a git repository (or any parent up to mount 
point /home/dam)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
t/whitespace.t ....... skipped: Skipping tests probably because you don't have 
git.

Test Summary Report
-------------------
t/binary.t         (Wstat: 256 Tests: 3639 Failed: 1)
  Failed test:  1625
  Non-zero exit status: 1
t/lru-maintainer.t (Wstat: 256 Tests: 224 Failed: 1)
  Failed test:  43
  Non-zero exit status: 1
t/stats.t          (Wstat: 256 Tests: 97 Failed: 1)
  Failed test:  74
  Non-zero exit status: 1
Files=51, Tests=7496, 192 wallclock secs ( 3.70 usr  0.61 sys + 14.37 cusr  
6.69 csys = 25.37 CPU)
Result: FAIL

Original comment by honkma...@googlemail.com on 20 Apr 2015 at 1:26

GoogleCodeExporter commented 8 years ago

I'm seeing this on Arch Linux as well.

This is similar to an earlier issue I have seen in that I think it might have 
to do with hardware lock elision: 
https://groups.google.com/d/msg/memcached/Tw6t_W-a6Xc/lXgz8LQ_vS0J

It doesn't lock up on my much older x86_64 desktop machine, but it does lock up 
on our build box. gdb output below of the hung process during the 
binary_prependq test. Notice Thread 3 especially.

$ sudo gdb -p 10999
GNU gdb (GDB) 7.9
Copyright (C) 2015 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 10999
Reading symbols from 
/home/dan/svn-packages/memcached/trunk/src/memcached-1.4.23/memcached-debug...do
ne.
Reading symbols from /usr/lib/libevent-2.0.so.5...(no debugging symbols 
found)...done.
Reading symbols from /usr/lib/libpthread.so.0...(no debugging symbols 
found)...done.
[New LWP 11004]
[New LWP 11003]
[New LWP 11002]
[New LWP 11001]
[New LWP 11000]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
Reading symbols from /usr/lib/libc.so.6...(no debugging symbols found)...done.
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols 
found)...done.
0x00007fad9dec6883 in epoll_wait () from /usr/lib/libc.so.6
(gdb) thread apply all backtrace

Thread 6 (Thread 0x7fad9dddd700 (LWP 11000)):
#0  0x00007fad9dec6883 in epoll_wait () from /usr/lib/libc.so.6
#1  0x00007fad9e3c34d8 in ?? () from /usr/lib/libevent-2.0.so.5
#2  0x00007fad9e3ae61a in event_base_loop () from /usr/lib/libevent-2.0.so.5
#3  0x000000000041add4 in worker_libevent (arg=0x1f99480) at thread.c:379
#4  0x00007fad9e188374 in start_thread () from /usr/lib/libpthread.so.0
#5  0x00007fad9dec627d in clone () from /usr/lib/libc.so.6

Thread 5 (Thread 0x7fad9d5dc700 (LWP 11001)):
#0  0x00007fad9dec6883 in epoll_wait () from /usr/lib/libc.so.6
#1  0x00007fad9e3c34d8 in ?? () from /usr/lib/libevent-2.0.so.5
#2  0x00007fad9e3ae61a in event_base_loop () from /usr/lib/libevent-2.0.so.5
#3  0x000000000041add4 in worker_libevent (arg=0x1f9a5c8) at thread.c:379
#4  0x00007fad9e188374 in start_thread () from /usr/lib/libpthread.so.0
#5  0x00007fad9dec627d in clone () from /usr/lib/libc.so.6

Thread 4 (Thread 0x7fad9cddb700 (LWP 11002)):
#0  0x00007fad9dec6883 in epoll_wait () from /usr/lib/libc.so.6
#1  0x00007fad9e3c34d8 in ?? () from /usr/lib/libevent-2.0.so.5
#2  0x00007fad9e3ae61a in event_base_loop () from /usr/lib/libevent-2.0.so.5
#3  0x000000000041add4 in worker_libevent (arg=0x1f9b710) at thread.c:379
#4  0x00007fad9e188374 in start_thread () from /usr/lib/libpthread.so.0
#5  0x00007fad9dec627d in clone () from /usr/lib/libc.so.6

Thread 3 (Thread 0x7fad9c5da700 (LWP 11003)):
#0  0x00007fad9e19064c in __lll_lock_wait () from /usr/lib/libpthread.so.0
#1  0x00007fad9e193090 in __lll_lock_elision () from /usr/lib/libpthread.so.0
#2  0x00000000004183c9 in item_stats_totals (add_stats=add_stats@entry=0x4067d0 
<append_stats>, c=c@entry=0x7fad940008c0) at items.c:506
#3  0x0000000000414c57 in get_stats (stat_type=stat_type@entry=0x0, 
nkey=nkey@entry=0, add_stats=add_stats@entry=0x4067d0 <append_stats>, 
c=c@entry=0x7fad940008c0) at slabs.c:309
#4  0x000000000040cee0 in process_bin_stat (c=0x7fad940008c0) at 
memcached.c:1514
#5  complete_nread_binary (c=0x7fad940008c0) at memcached.c:2247
#6  complete_nread (c=c@entry=0x7fad940008c0) at memcached.c:2293
#7  0x00000000004101d0 in drive_machine (c=0x7fad940008c0) at memcached.c:4179
#8  event_handler (fd=<optimized out>, which=<optimized out>, 
arg=0x7fad940008c0) at memcached.c:4386
#9  0x00007fad9e3aeca6 in event_base_loop () from /usr/lib/libevent-2.0.so.5
#10 0x000000000041add4 in worker_libevent (arg=0x1f9c858) at thread.c:379
#11 0x00007fad9e188374 in start_thread () from /usr/lib/libpthread.so.0
#12 0x00007fad9dec627d in clone () from /usr/lib/libc.so.6

Thread 2 (Thread 0x7fad9bdd9700 (LWP 11004)):
#0  0x00007fad9e18d9af in pthread_cond_wait@@GLIBC_2.3.2 () from 
/usr/lib/libpthread.so.0
#1  0x000000000041a510 in assoc_maintenance_thread (arg=<optimized out>) at 
assoc.c:261
#2  0x00007fad9e188374 in start_thread () from /usr/lib/libpthread.so.0
#3  0x00007fad9dec627d in clone () from /usr/lib/libc.so.6

Thread 1 (Thread 0x7fad9e7f2700 (LWP 10999)):
#0  0x00007fad9dec6883 in epoll_wait () from /usr/lib/libc.so.6
#1  0x00007fad9e3c34d8 in ?? () from /usr/lib/libevent-2.0.so.5
#2  0x00007fad9e3ae61a in event_base_loop () from /usr/lib/libevent-2.0.so.5
#3  0x0000000000404eea in main (argc=0, argv=0x7fff82d49fd5) at memcached.c:5724

ok 47 - binary_prependq
Timeout.. killing the process
testapp: testapp.c:725: safe_recv: Assertion `nr != 0' failed.
Makefile:1482: recipe for target 'test' failed
make: *** [test] Aborted

processor   : 7
vendor_id   : GenuineIntel
cpu family  : 6
model       : 60
model name  : Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
stepping    : 3
microcode   : 0x12
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat 
pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp 
lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc 
aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 
fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer 
aes xsave avx f16c rdrand lahf_lm abm ida arat pln pts dtherm tpr_shadow vnmi 
flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid 
rtm xsaveopt

Original comment by dpmc...@gmail.com on 23 Apr 2015 at 1:54

GoogleCodeExporter commented 8 years ago

Can you folks please try https://github.com/memcached/memcached (master branch) 
and ping this issue on if it repairs your problem completey? All tests should 
run.

Original comment by dorma...@rydia.net on 24 Apr 2015 at 7:05

Changed state: Accepted

GoogleCodeExporter commented 8 years ago

I ran the same build/test/package steps that triggered the above, pulling the 
one (369845f086) patch from git and applying it, and we seem to be in business. 
All tests pass, no hangups or timeouts anymore.

Original comment by dpmc...@gmail.com on 24 Apr 2015 at 2:41

GoogleCodeExporter commented 8 years ago

I also just patched in the latest commit f086 to the tarball.

The test suite now passes on sparcv9 cleanly, the only issue on amd64 left is

t/lru-maintainer.t ... 11/224 
#   Failed test 'moved some items to cold'
#   at t/lru-maintainer.t line 33.
#          got: '0'
#     expected: anything else
t/lru-maintainer.t ... 195/224 # Looks like you failed 1 test of 224.
t/lru-maintainer.t ... Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/224 subtests

Original comment by honkma...@googlemail.com on 24 Apr 2015 at 3:55

GoogleCodeExporter commented 8 years ago

BTW, what happened to the buildfarm?
  https://code.google.com/p/memcached/wiki/BuildFarm
The waterfall display is stuck for me.

As I already run a buildbot instance on our farm I can gladly provide some 
slaves on Solaris:
  http://buildfarm.opencsw.org/buildbot/waterfall

Original comment by honkma...@googlemail.com on 24 Apr 2015 at 3:57

GoogleCodeExporter commented 8 years ago

buildfarm's been dead a long while... it requires quite a lot of time to keep 
maintained.

honkman42: That is failing a test under solaris amd64? Can you attach the 
output of "prove -v t/lru-maintainer.t" please? Does it fail every time or just 
sometimes?

Original comment by dorma...@rydia.net on 24 Apr 2015 at 6:06

GoogleCodeExporter commented 8 years ago

Another push to master for a workaround in the test. Solaris seems to not 
handle the background juggler's sleeps in the same way. I'll look into it more, 
but need to cut another release since the off by one is pretty serious.

Original comment by dorma...@rydia.net on 25 Apr 2015 at 8:03

GoogleCodeExporter commented 8 years ago

1.4.24 is out. please try if you haven't already.

Original comment by dorma...@rydia.net on 25 Apr 2015 at 9:02

Changed state: Fixed

GoogleCodeExporter commented 8 years ago

Yes, 1.4.24 looks good, all tests pass on Solaris 10 Sparc + x86, both in 32 
and 64 bit.

I also added memcached to my Solaris buildbot and 'master' with some more 
options enabled:
  https://buildfarm.opencsw.org/buildbot/waterfall?category=memcached
Now t/whitespace.t is failing sporadically as you can see from the above logs. 
Should I open a separate bug report for that?

Also I noticed that some tests failed when I run 32 and 64 bit tests in 
parallel, I guess I need to serialize it per host?

If considered useful I could also enable notifications on build failure via 
mail or irc or such.

Thanks! -- Dago

Original comment by honkma...@googlemail.com on 27 Apr 2015 at 8:39

GoogleCodeExporter commented 8 years ago

please open other issues for those things. you can have the thing e-mail me I 
guess? don't want it to spam a public place just yet.

Original comment by dorma...@rydia.net on 27 Apr 2015 at 8:54

GoogleCodeExporter commented 8 years ago

Sure, I added your email to the list of notifiers. Just let me know if you need 
adjusted settings.

Original comment by honkma...@googlemail.com on 27 Apr 2015 at 8:59

sianla / memcached

memcached 1.4.23 fails test binary_prependq on Solaris 10 Sparc 64 Bit #404