jepsen-io / jepsen

A framework for distributed systems verification, with fault injection
6.81k stars 719 forks source link

Test failures: is this expected? #128

Closed garyxia closed 6 years ago

garyxia commented 8 years ago

After using Oracle Java 8 jepsen starts to run! However it seems to fail eventually. Zookeeper test either doesn't stop overnight. Or stop after running out of memory.

zookeeper has no report produced.


clojure.main.main (main.java:37)

Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space at java.util.concurrent.FutureTask.report (FutureTask.java:122) java.util.concurrent.FutureTask.get (FutureTask.java:192) clojure.core$deref_future.invoke (core.clj:2186) clojure.core$future_call$reify6736.deref (core.clj:6683) clojure.core$deref.invoke (core.clj:2206) clojure.core$pmap$step6749$fn6751.invoke (core.clj:6733) clojure.lang.LazySeq.sval (LazySeq.java:40) clojure.lang.LazySeq.seq (LazySeq.java:49) clojure.lang.Cons.next (Cons.java:39) clojure.lang.RT.next (RT.java:674) clojure.core/next (core.clj:64) clojure.core$concat$cat4217$fn4218.invoke (core.clj:707) clojure.lang.LazySeq.sval (LazySeq.java:40) clojure.lang.LazySeq.seq (LazySeq.java:49) clojure.lang.ChunkedCons.chunkedNext (ChunkedCons.java:59) clojure.lang.ChunkedCons.next (ChunkedCons.java:43) clojure.lang.RT.next (RT.java:674) clojure.core/next (core.clj:64) clojure.core.protocols$naive_seq_reduce.invoke (protocols.clj:65) clojure.core.protocols$interface_or_naive_reduce.invoke (protocols.clj:73) clojure.core.protocols/fn (protocols.clj:171) clojure.core.protocols$fn6478$G6473__6487.invoke (protocols.clj:19) clojure.core.protocols$seq_reduce.invoke (protocols.clj:31) clojure.core.protocols/fn (protocols.clj:101) clojure.core.protocols$fn6452$G64476465.invoke (protocols.clj:13) clojure.core$reduce.invoke (core.clj:6519) knossos.linear$step.invoke (linear.clj:251) clojure.core$partial$fn4529.invoke (core.clj:2501) clojure.lang.PersistentVector.reduce (PersistentVector.java:333) clojure.core$reduce.invoke (core.clj:6518) knossos.linear$analysis.invoke (linear.clj:312) jepsen.checker$reify6560.check (checker.clj:53) jepsen.checker$compose$reify6600$fn6602.invoke (checker.clj:256) clojure.core$pmap$fn6744$fn6745.invoke (core.clj:6729) clojure.core$binding_conveyor_fn$fn__4444.invoke (core.clj:1916) clojure.lang.AFn.call (AFn.java:18) java.util.concurrent.FutureTask.run (FutureTask.java:266) java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1142) java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:617) java.lang.Thread.run (Thread.java:745) Caused by: java.lang.OutOfMemoryError: Java heap space at [empty stack trace]

lein test :only jepsen.zookeeper-test/zk-test

FAIL in (zk-test) (zookeeper_test.clj:7) expected: (:valid? (:results (jepsen/run! (zk/zk-test "3.4.5+dfsg-2")))) actual: false

Ran 1 tests containing 1 assertions. 1 failures, 0 errors. Tests failed.

garyxia commented 8 years ago

aerospike ran better but still failed. It generated a report directory.

I attach the report files here. github won't take history.edn so I changed it to history.txt.

My question is: is the test supposed to 'fail'? Meaning that it finds problems? Or the test really was a failure. The print out is loooooooooooooong. I am going to copy & paste in next message.

counter.txt linearizability.txt history.txt

garyxia commented 8 years ago

The output is too long. I save it as a text file.

My take is that the test found some inconsistency, the test itself is ok. Can anyone confirm? Thanks!

output.txt

aphyr commented 8 years ago

Linearizability verification is computationally expensive. I run these tests on a 48-way Xeon with 128GB of RAM. However, you can usually tune tests to be less expensive, by shortening the test's time limit, or increasing the delay between operations.

aphyr commented 8 years ago

Tests which include {:valid? false} indicate that Jepsen found an invalid behavior in the system. In one sense, these tests are successful because they found a bug. In another sense, they fail because the system is not correct.