Stuck test suite & Jenkins' inability to recover

dumbbell commented 9 years ago

Note: I file the issue here because the Java client is involved in the stuck test and I don't know yet what's going on, but I don't have the time to study this right now.

The culprit is a timed out or aborted Jenkins build: Jenkins is unable to kill all involved processes (and doesn't notice the problem). Then it starts new builds which try to "lock" the node, fail to do so, try again forever, eventually consuming all their stack frames and segfault.

An example is this aborted build: http://rabbit-ci.lon.pivotallabs.com:8080/job/RabbitMQ%20Server/3734/

Followed by this build which segfaults: http://rabbit-ci.lon.pivotallabs.com:8080/job/xref%20%28plugins%20individually%29/4019/

Here are the running stuck processes on the Jenkins slave:

jenkins  20497  0.0  0.0  12880  2688 ?        S    May19   1:46 /opt/erlang/r16b03/lib/erlang/erts-5.10.4/bin/epmd -daemon
jenkins  23906  0.0  0.0   4176   580 ?        S    05:44   0:00 sh -c make run-background-node > /tmp/rabbitmq-hare-mnesia/startup_log 2> /tmp/rabbitmq-hare-mnesia/startup_err
jenkins  23909  0.0  0.0  10640  1824 ?        S    05:44   0:00 make run-background-node
jenkins  24097  0.0  0.0   4176   580 ?        S    05:44   0:00 /bin/sh -c RABBITMQ_NODE_IP_ADDRESS="0.0.0.0" RABBITMQ_NODE_PORT="5673" RABBITMQ_LOG_BASE="/tmp" RABBITMQ_MNESIA_DIR="/tmp/rabbitmq-hare-mnesia" RABBITMQ_PLUGINS_EXPAND_DIR="/tmp/rabbitmq-hare-plugins-scratch" \ ?RABBITMQ_NODE_ONLY=true \ ?RABBITMQ_SERVER_START_ARGS="-rabbit ssl_listeners [{\"0.0.0.0\",5670}] -rabbit ssl_options [{cacertfile,\"/tmp/test/rabbitmq-public-umbrella/rabbitmq-test/certs/testca/cacert.pem\"},{certfile,\"/tmp/test/rabbitmq-public-umbrella/rabbitmq-test/certs/server/cert.pem\"},{keyfile,\"/tmp/test/rabbitmq-public-umbrella/rabbitmq-test/certs/server/key.pem\"},{verify_code,1}] -rabbit auth_mechanisms ['PLAIN','AMQPLAIN','EXTERNAL','RABBIT-CR-DEMO']" \ ?./scripts/rabbitmq-server
jenkins  24098  0.6  1.0 192300 41704 ?        Sl   05:44   2:16 /opt/erlang/r16b03/lib/erlang/erts-5.10.4/bin/beam.smp -W w -K true -A30 -P 1048576 -- -root /opt/erlang/r16b03/lib/erlang -progname erl -- -home /var/lib/jenkins -- -pa ./scripts/../ebin -noshell -noinput -sname hare -boot start_sasl -kernel inet_default_connect_options [{nodelay,true}] -rabbit tcp_listeners [{"0.0.0.0",5673}] -sasl errlog_type error -sasl sasl_error_logger false -rabbit error_logger {file,"/tmp/hare.log"} -rabbit sasl_error_logger {file,"/tmp/hare-sasl.log"} -rabbit enabled_plugins_file "/does-not-exist" -rabbit plugins_dir "./scripts/../plugins" -rabbit plugins_expand_dir "/tmp/rabbitmq-hare-plugins-scratch" -os_mon start_cpu_sup false -os_mon start_disksup false -os_mon start_memsup false -mnesia dir "/tmp/rabbitmq-hare-mnesia" -rabbit ssl_listeners [{"0.0.0.0",5670}] -rabbit ssl_options [{cacertfile,"/tmp/test/rabbitmq-public-umbrella/rabbitmq-test/certs/testca/cacert.pem"},{certfile,"/tmp/test/rabbitmq-public-umbrella/rabbitmq-test/certs/server/cert.pem"},{keyfile,"/tmp/test/rabbitmq-public-umbrella/rabbitmq-test/certs/server/key.pem"},{verify_code,1}] -rabbit auth_mechanisms ['PLAIN','AMQPLAIN','EXTERNAL','RABBIT-CR-DEMO'] -kernel inet_dist_listen_min 25673 -kernel inet_dist_listen_max 25673
root     24250  0.1  2.0 178356 82980 ?        Ss   02:30   0:44 /usr/bin/ruby1.8 /usr/bin/puppet agent
jenkins  24337  0.0  0.0  10796   368 ?        Ss   05:44   0:00 inet_gethost 4
jenkins  24338  0.0  0.0  17100   804 ?        S    05:44   0:00 inet_gethost 4
root     24363  0.0  0.0  71260  3528 ?        Ss   05:45   0:00 sshd: jenkins [priv]
jenkins  24365  0.0  0.0  71916  2232 ?        S    05:45   0:01 sshd: jenkins@notty
jenkins  24387  0.0  0.0  10752  1212 ?        Ss   05:45   0:00 bash -c cd "/var/lib/jenkins" && java  -jar slave.jar
jenkins  24388  0.0  2.0 1568768 83180 ?       Sl   05:45   0:15 java -jar slave.jar
jenkins  25757  0.0  0.0   4180   576 ?        S    05:37   0:00 /bin/sh -xe /tmp/hudson8511670185586124855.sh
jenkins  25759  0.0  0.0   4180   684 ?        S    05:37   0:00 /bin/sh -e /usr/local/bin/run-server-tests
jenkins  25766  0.0  0.0   4180   432 ?        S    05:37   0:00 /bin/sh -e /usr/local/bin/run-server-tests
jenkins  26320  0.0  0.0  10368  1336 ?        S    05:39   0:00 make COVER=false all
ntp      26452  0.0  0.0  34864  2100 ?        Ss   Apr13   2:49 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -c /var/lib/ntp/ntp.conf.dhcp -u 107:114
jenkins  26537  0.0  0.0   4176   576 ?        S    05:39   0:00 /bin/sh -c OK=true && \ make prepare && \ { make -C ../rabbitmq-server run-tests || { OK=false; echo '\n============' '\nTESTS FAILED' '\n============\n'; } } && \ { make run-qpid-testsuite || { OK=false; echo '\n============' '\nTESTS FAILED' '\n============\n'; } } && \ { ( cd ../rabbitmq-java-client && MAKE=make ant test-suite ) || { OK=false; echo '\n============' '\nTESTS FAILED' '\n============\n'; } } && \ make cleanup && { $OK || echo '\n============' '\nTESTS FAILED' '\n============\n'; } && $OK
daemon   26683  0.0  0.0  16672   152 ?        Ss   Mar31   0:00 /usr/sbin/atd
root     26796  0.0  0.0  49932  1236 ?        Ss   Mar31   0:00 /usr/sbin/sshd
root     26798  0.0  0.0      0     0 ?        S    May18   2:21 [kworker/1:1]
jenkins  29828  0.0  0.0   4176   572 ?        S    05:40   0:00 sh -c make run-background-node > /tmp/rabbitmq-rabbit-mnesia/startup_log 2> /tmp/rabbitmq-rabbit-mnesia/startup_err
jenkins  29831  0.0  0.0  10640  1824 ?        S    05:40   0:00 make run-background-node
jenkins  30019  0.0  0.0   4176   576 ?        S    05:40   0:00 /bin/sh -c RABBITMQ_NODE_IP_ADDRESS="0.0.0.0" RABBITMQ_NODE_PORT="5672" RABBITMQ_LOG_BASE="/tmp" RABBITMQ_MNESIA_DIR="/tmp/rabbitmq-rabbit-mnesia" RABBITMQ_PLUGINS_EXPAND_DIR="/tmp/rabbitmq-rabbit-plugins-scratch" \ ?RABBITMQ_NODE_ONLY=true \ ?RABBITMQ_SERVER_START_ARGS="-rabbit ssl_listeners [{\"0.0.0.0\",5671}] -rabbit ssl_options [{cacertfile,\"/tmp/test/rabbitmq-public-umbrella/rabbitmq-test/certs/testca/cacert.pem\"},{certfile,\"/tmp/test/rabbitmq-public-umbrella/rabbitmq-test/certs/server/cert.pem\"},{keyfile,\"/tmp/test/rabbitmq-public-umbrella/rabbitmq-test/certs/server/key.pem\"},{verify_code,1}] -rabbit auth_mechanisms ['PLAIN','AMQPLAIN','EXTERNAL','RABBIT-CR-DEMO']" \ ?./scripts/rabbitmq-server
jenkins  30020  0.7  1.1 261436 45724 ?        Sl   05:40   2:34 /opt/erlang/r16b03/lib/erlang/erts-5.10.4/bin/beam.smp -W w -K true -A30 -P 1048576 -- -root /opt/erlang/r16b03/lib/erlang -progname erl -- -home /var/lib/jenkins -- -pa ./scripts/../ebin -noshell -noinput -sname rabbit@rabbit-ci-slave1 -boot start_sasl -kernel inet_default_connect_options [{nodelay,true}] -rabbit tcp_listeners [{"0.0.0.0",5672}] -sasl errlog_type error -sasl sasl_error_logger false -rabbit error_logger {file,"/tmp/rabbit@rabbit-ci-slave1.log"} -rabbit sasl_error_logger {file,"/tmp/rabbit@rabbit-ci-slave1-sasl.log"} -rabbit enabled_plugins_file "/does-not-exist" -rabbit plugins_dir "./scripts/../plugins" -rabbit plugins_expand_dir "/tmp/rabbitmq-rabbit-plugins-scratch" -os_mon start_cpu_sup false -os_mon start_disksup false -os_mon start_memsup false -mnesia dir "/tmp/rabbitmq-rabbit-mnesia" -rabbit ssl_listeners [{"0.0.0.0",5671}] -rabbit ssl_options [{cacertfile,"/tmp/test/rabbitmq-public-umbrella/rabbitmq-test/certs/testca/cacert.pem"},{certfile,"/tmp/test/rabbitmq-public-umbrella/rabbitmq-test/certs/server/cert.pem"},{keyfile,"/tmp/test/rabbitmq-public-umbrella/rabbitmq-test/certs/server/key.pem"},{verify_code,1}] -rabbit auth_mechanisms ['PLAIN','AMQPLAIN','EXTERNAL','RABBIT-CR-DEMO'] -kernel inet_dist_listen_min 25672 -kernel inet_dist_listen_max 25672
jenkins  30179  0.0  0.0  10796   364 ?        Ss   05:40   0:00 inet_gethost 4
jenkins  30180  0.0  0.0  17100   812 ?        S    05:40   0:00 inet_gethost 4
jenkins  30942  0.1  3.6 1318924 147060 ?      Sl   05:42   0:34 /opt/java/latest/bin/java -classpath /usr/share/ant/lib/ant-launcher.jar:/usr/share/java/xmlParserAPIs.jar:/usr/share/java/xercesImpl.jar -Dant.home=/usr/share/ant -Dant.library.dir=/usr/share/ant/lib org.apache.tools.ant.launch.Launcher -cp  test-suite
jenkins  31312  0.0  0.8 1311756 35004 ?       Sl   05:42   0:10 /opt/java/jdk1.6.0_26/jre/bin/java -classpath /tmp/test/rabbitmq-public-umbrella/rabbitmq-java-client/lib/commons-cli-1.1.jar:/tmp/test/rabbitmq-public-umbrella/rabbitmq-java-client/lib/commons-io-1.2.jar:/tmp/test/rabbitmq-public-umbrella/rabbitmq-java-client/lib/junit.jar:/tmp/test/rabbitmq-public-umbrella/rabbitmq-java-client/build/classes:/tmp/test/rabbitmq-public-umbrella/rabbitmq-java-client/build/test/classes:/usr/share/ant/lib/junit.jar:/usr/share/java/ant-launcher-1.8.2.jar:/usr/share/ant/lib/ant.jar:/usr/share/ant/lib/ant-junit.jar:/usr/share/ant/lib/ant-junit4.jar org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner com.rabbitmq.client.test.server.ServerTests filtertrace=true haltOnError=false haltOnFailure=false formatter=org.apache.tools.ant.taskdefs.optional.junit.OutErrSummaryJUnitResultFormatter showoutput=false outputtoformatters=true logfailedtests=true logtestlistenerevents=false formatter=org.apache.tools.ant.taskdefs.optional.junit.PlainJUnitResultFormatter,/tmp/test/rabbitmq-public-umbrella/rabbitmq-java-client/build/TEST-com.rabbitmq.client.test.server.ServerTests.txt formatter=org.apache.tools.ant.taskdefs.optional.junit.XMLJUnitResultFormatter,/tmp/test/rabbitmq-public-umbrella/rabbitmq-java-client/build/TEST-com.rabbitmq.client.test.server.ServerTests.xml crashfile=/tmp/test/rabbitmq-public-umbrella/rabbitmq-java-client/junitvmwatcher7096656096857623053.properties propsfile=/tmp/test/rabbitmq-public-umbrella/rabbitmq-java-client/junit6938551034434136649.properties

michaelklishin commented 9 years ago

All Java tests should have a timeout, which may be a good reason to upgrade to JUnit 4 soon.

michaelklishin commented 8 years ago

I haven't seen this in a while. Is this still relevant? Now that we are on JUnit 4, we should have more options with respect to how we enforce timeouts.

dumbbell commented 8 years ago

Our testsuite can still leave running nodes behind after a failure. But Jenkins doesn't use the wrapper script which loops anymore, except for the stable branch of the broker. So the root cause, leaving running nodes, is still relevant, even if the segfault are rare now.

michaelklishin commented 8 years ago

Chances are this was https://github.com/rabbitmq/rabbitmq-server/issues/465, so closing until we discover something Java test suites-specific.

rabbitmq / rabbitmq-java-client

Stuck test suite & Jenkins' inability to recover #67