jepsen-io / knossos

Verifies the linearizability of experimentally accessible histories.
398 stars 31 forks source link

Assertion errors from Java #27

Open itamararjuan opened 5 years ago

itamararjuan commented 5 years ago

Hi @aphyr - Thank you for the amazing framework you've supplied us with to test out our databases I have a working repo with Jepsen testing out our Blockchain Go Implementation at Orbs and have encountered something weird with Knossos verifying history and firing many Java Assertion Errors like the following attached here

The weird thing is, the history checks appears to be valid and is quite confusing for me to know whether the test has passed or not due to these errors. jepsen.log.zip

Would love your feedback. I am attaching here a small fragment of what comes out from stdout:

WARN [2018-11-06 13:46:47,969] clojure-agent-send-pool-2 - jepsen.checker Error while checking history: java.util.concurrent.ExecutionException: java.lang.AssertionError: Assert failed: Process completed an operation without a prior invocation: {:type :ok, :f :read, :value nil, :process 9, :time 682238600, :index 10} i at java.util.concurrent.FutureTask.report(FutureTask.java:122) [na:1.8.0_181] at java.util.concurrent.FutureTask.get(FutureTask.java:192) [na:1.8.0_181] at clojure.core$deref_future.invokeStatic(core.clj:2208) ~[clojure-1.8.0.jar:na] at clojure.core$future_call$reify__6962.deref(core.clj:6688) ~[clojure-1.8.0.jar:na] at clojure.core$deref.invokeStatic(core.clj:2228) ~[clojure-1.8.0.jar:na] at clojure.core$deref.invoke(core.clj:2214) ~[clojure-1.8.0.jar:na] at clojure.core$map$fn__4785.invoke(core.clj:2646) ~[clojure-1.8.0.jar:na] at clojure.lang.LazySeq.sval(LazySeq.java:40) ~[clojure-1.8.0.jar:na] at clojure.lang.LazySeq.seq(LazySeq.java:56) ~[clojure-1.8.0.jar:na] at clojure.lang.RT.seq(RT.java:521) ~[clojure-1.8.0.jar:na] at clojure.core$seq__4357.invokeStatic(core.clj:137) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$seq_reduce.invokeStatic(protocols.clj:24) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn__6738.invokeStatic(protocols.clj:75) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn__6738.invoke(protocols.clj:75) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn__6684$G__6679__6697.invoke(protocols.clj:13) ~[clojure-1.8.0.jar:na] at clojure.core$reduce.invokeStatic(core.clj:6545) ~[clojure-1.8.0.jar:na] at clojure.core$into.invokeStatic(core.clj:6610) ~[clojure-1.8.0.jar:na] at clojure.core$into.invoke(core.clj:6604) ~[clojure-1.8.0.jar:na] at jepsen.checker$compose$reify__3952.check(checker.clj:88) ~[jepsen-0.1.9-SNAPSHOT.jar:na] at jepsen.checker$check_safe.invokeStatic(checker.clj:71) [jepsen-0.1.9-SNAPSHOT.jar:na] at jepsen.checker$check_safe.invoke(checker.clj:64) [jepsen-0.1.9-SNAPSHOT.jar:na] at jepsen.independent$checker$reify__2302$fn__2303.invoke(independent.clj:272) [na:na] at dom_top.core$bounded_pmap$launcher__218$fn__219.invoke(core.clj:129) [jepsen-0.1.9-SNAPSHOT.jar:na] at clojure.core$binding_conveyor_fn$fn__4676.invoke(core.clj:1938) [clojure-1.8.0.jar:na] at clojure.lang.AFn.call(AFn.java:18) [clojure-1.8.0.jar:na] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_181] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_181] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_181] at java.lang.Thread.run(Thread.java:748) [na:1.8.0_181] Caused by: java.lang.AssertionError: Assert failed: Process completed an operation without a prior invocation: {:type :ok, :f :read, :value nil, :process 9, :time 682238600, :index 10} i at knossos.history$complete_fold_op.invokeStatic(history.clj:187) ~[knossos-0.3.3.jar:na] at knossos.history$complete_fold_op.invoke(history.clj:160) ~[knossos-0.3.3.jar:na] at clojure.lang.ArrayChunk.reduce(ArrayChunk.java:58) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn__6750.invokeStatic(protocols.clj:136) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn__6750.invoke(protocols.clj:124) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn__6710$G__6705__6719.invoke(protocols.clj:19) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$seq_reduce.invokeStatic(protocols.clj:31) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn__6738.invokeStatic(protocols.clj:75) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn__6738.invoke(protocols.clj:75) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn__6684$G__6679__6697.invoke(protocols.clj:13) ~[clojure-1.8.0.jar:na] at clojure.core$reduce.invokeStatic(core.clj:6545) ~[clojure-1.8.0.jar:na] at clojure.core$reduce.invoke(core.clj:6527) ~[clojure-1.8.0.jar:na] at knossos.history$complete.invokeStatic(history.clj:241) ~[knossos-0.3.3.jar:na] at knossos.history$complete.invoke(history.clj:221) ~[knossos-0.3.3.jar:na] at jepsen.checker.timeline$html$reify__2386.check(timeline.clj:163) ~[na:na] at jepsen.checker$check_safe.invokeStatic(checker.clj:71) [jepsen-0.1.9-SNAPSHOT.jar:na] at jepsen.checker$check_safe.invoke(checker.clj:64) [jepsen-0.1.9-SNAPSHOT.jar:na] at jepsen.checker$compose$reify__3952$fn__3954.invoke(checker.clj:87) ~[jepsen-0.1.9-SNAPSHOT.jar:na] at clojure.core$pmap$fn__6970$fn__6971.invoke(core.clj:6736) ~[clojure-1.8.0.jar:na] ... 6 common frames omitted

I also am attaching the full log of the running

Thanks and hope to hear from you soon! -- Itamar Arjuan

aphyr commented 5 years ago

Hi itamar, and thank you. It sounds like your history might be malformed; a process should never complete an operation without having invoked it first. I think knossos probably should abort in this scenario... Might be a bug if it doesn't. I'd take a careful look at the history to see what process 9 was up to before that op.--KyleOn Nov 7, 2018 03:29, Itamar Arjuan notifications@github.com wrote:Hi @aphyr - Thank you for the amazing framework you've supplied us with to test out our databases I have a working repo with Jepsen testing out our Blockchain Go Implementation at Orbs and have encountered something weird with Knossos verifying history and firing many Java Assertion Errors like the following attached here The weird thing is, the history checks appears to be valid and is quite confusing for me to know whether the test has passed or not due to these errors. jepsen.log.zip Would love your feedback. I am attaching here a small fragment of what comes out from stdout: WARN [2018-11-06 13:46:47,969] clojure-agent-send-pool-2 - jepsen.checker Error while checking history: java.util.concurrent.ExecutionException: java.lang.AssertionError: Assert failed: Process completed an operation without a prior invocation: {:type :ok, :f :read, :value nil, :process 9, :time 682238600, :index 10} i at java.util.concurrent.FutureTask.report(FutureTask.java:122) [na:1.8.0_181] at java.util.concurrent.FutureTask.get(FutureTask.java:192) [na:1.8.0_181] at clojure.core$deref_future.invokeStatic(core.clj:2208) ~[clojure-1.8.0.jar:na] at clojure.core$future_call$reify6962.deref(core.clj:6688) ~[clojure-1.8.0.jar:na] at clojure.core$deref.invokeStatic(core.clj:2228) ~[clojure-1.8.0.jar:na] at clojure.core$deref.invoke(core.clj:2214) ~[clojure-1.8.0.jar:na] at clojure.core$map$fn4785.invoke(core.clj:2646) ~[clojure-1.8.0.jar:na] at clojure.lang.LazySeq.sval(LazySeq.java:40) ~[clojure-1.8.0.jar:na] at clojure.lang.LazySeq.seq(LazySeq.java:56) ~[clojure-1.8.0.jar:na] at clojure.lang.RT.seq(RT.java:521) ~[clojure-1.8.0.jar:na] at clojure.core$seq4357.invokeStatic(core.clj:137) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$seq_reduce.invokeStatic(protocols.clj:24) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn6738.invokeStatic(protocols.clj:75) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn6738.invoke(protocols.clj:75) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn6684$G66796697.invoke(protocols.clj:13) ~[clojure-1.8.0.jar:na] at clojure.core$reduce.invokeStatic(core.clj:6545) ~[clojure-1.8.0.jar:na] at clojure.core$into.invokeStatic(core.clj:6610) ~[clojure-1.8.0.jar:na] at clojure.core$into.invoke(core.clj:6604) ~[clojure-1.8.0.jar:na] at jepsen.checker$compose$reify3952.check(checker.clj:88) ~[jepsen-0.1.9-SNAPSHOT.jar:na] at jepsen.checker$check_safe.invokeStatic(checker.clj:71) [jepsen-0.1.9-SNAPSHOT.jar:na] at jepsen.checker$check_safe.invoke(checker.clj:64) [jepsen-0.1.9-SNAPSHOT.jar:na] at jepsen.independent$checker$reify2302$fn2303.invoke(independent.clj:272) [na:na] at dom_top.core$bounded_pmap$launcher218$fn219.invoke(core.clj:129) [jepsen-0.1.9-SNAPSHOT.jar:na] at clojure.core$binding_conveyor_fn$fn4676.invoke(core.clj:1938) [clojure-1.8.0.jar:na] at clojure.lang.AFn.call(AFn.java:18) [clojure-1.8.0.jar:na] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_181] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_181] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_181] at java.lang.Thread.run(Thread.java:748) [na:1.8.0_181] Caused by: java.lang.AssertionError: Assert failed: Process completed an operation without a prior invocation: {:type :ok, :f :read, :value nil, :process 9, :time 682238600, :index 10} i at knossos.history$complete_fold_op.invokeStatic(history.clj:187) ~[knossos-0.3.3.jar:na] at knossos.history$complete_fold_op.invoke(history.clj:160) ~[knossos-0.3.3.jar:na] at clojure.lang.ArrayChunk.reduce(ArrayChunk.java:58) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn6750.invokeStatic(protocols.clj:136) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn6750.invoke(protocols.clj:124) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn6710$G67056719.invoke(protocols.clj:19) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$seq_reduce.invokeStatic(protocols.clj:31) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn6738.invokeStatic(protocols.clj:75) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn6738.invoke(protocols.clj:75) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn6684$G66796697.invoke(protocols.clj:13) ~[clojure-1.8.0.jar:na] at clojure.core$reduce.invokeStatic(core.clj:6545) ~[clojure-1.8.0.jar:na] at clojure.core$reduce.invoke(core.clj:6527) ~[clojure-1.8.0.jar:na] at knossos.history$complete.invokeStatic(history.clj:241) ~[knossos-0.3.3.jar:na] at knossos.history$complete.invoke(history.clj:221) ~[knossos-0.3.3.jar:na] at jepsen.checker.timeline$html$reify2386.check(timeline.clj:163) ~[na:na] at jepsen.checker$check_safe.invokeStatic(checker.clj:71) [jepsen-0.1.9-SNAPSHOT.jar:na] at jepsen.checker$check_safe.invoke(checker.clj:64) [jepsen-0.1.9-SNAPSHOT.jar:na] at jepsen.checker$compose$reify3952$fn3954.invoke(checker.clj:87) ~[jepsen-0.1.9-SNAPSHOT.jar:na] at clojure.core$pmap$fn6970$fn__6971.invoke(core.clj:6736) ~[clojure-1.8.0.jar:na] ... 6 common frames omitted I also am attaching the full log of the running Thanks and hope to hear from you soon! -- Itamar Arjuan

—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or mute the thread.

itamararjuan commented 5 years ago

Hi @aphyr , I recorded a smaller running, and staggered the requests a bit less dense and wound up with 2 such history assertion fails. but here's the thing, the process that knossos claimed was providing :ok op without an :invoke was perfect when following the log line by line.

Each :invoke had a corresponding :ok or :fail and still it said the same thing. Also something here is quite bizarre in my eyes since the :invoke op is not something that my code adds into knossos , it's added by your code and is implicit to me.

The only part which is up to the code that is written into the test is the reify implementation of the client 's invoke! method which is responsible for the response part of the history record.

So I can't really think of a way that my code would of generated such a scenario where the implicit insertion to history of the :invoke part would be missing from the process.

Regardless of this: following the claimed process's history came up with it being perfectly symetric in terms of the ratio between invokes and ok(s)/fail(s)

Is there any reason in your eyes as to why knossoswould flag the history analysis as valid after these violations? which is also a weird part of this whole situation.

Thanks for feedback! -- Itamar

itamararjuan commented 5 years ago

For example, this is what the history.edn file includes at the complaint-about index:

{:type :invoke, :f :read, :value [1 nil], :process 1, :time 47689039400, :index 622} {:type :ok, :f :read, :value 0, :process 1, :time 47716834600, :index 623}

while the error that the checker throw is the following one:

INFO [2018-11-07 12:23:54,831] main - jepsen.core {:perf {:latency-graph {:valid? true}, :rate-graph {:valid? true}, :valid? true}, :indep {:valid? :unknown, :results {0 {:valid? :unknown, :error "java.util.concurrent.ExecutionException: java.lang.AssertionError: Assert failed: Process completed an operation without a prior invocation: {:type :ok, :f :read, :value 0, :process 1, :time 47716834600, :index 623}\ni\n at java.util.concurrent.FutureTask.report (FutureTask.java:122)\n java.util.concurrent.FutureTask.get (FutureTask.java:192)\n clojure.core$deref_future.invokeStatic (core.clj:2208)\n clojure.core$future_call$reify__6962.deref (core.clj:6688)\n clojure.core$deref.invokeStatic (core.clj:2228)\n clojure.core$deref.invoke (core.clj:2214)\n clojure.core$map$fn__4785.invoke (core.clj:2646)\n clojure.lang.LazySeq.sval (LazySeq.java:40)\n clojure.lang.LazySeq.seq (LazySeq.java:56)\n clojure.lang.RT.seq (RT.java:521)\n clojure.core$seq__4357.invokeStatic (core.clj:137)\n clojure.core.protocols$seq_reduce.invokeStatic (protocols.clj:24)\n clojure.core.protocols$fn__6738.invokeStatic (protocols.clj:75)\n clojure.core.protocols/fn (protocols.clj:75)\n clojure.core.protocols$fn__6684$G__6679__6697.invoke (protocols.clj:13)\n clojure.core$reduce.invokeStatic (core.clj:6545)\n clojure.core$into.invokeStatic (core.clj:6610)\n clojure.core$into.invoke (core.clj:6604)\n jepsen.checker$compose$reify__3952.check (checker.clj:88)\n jepsen.checker$check_safe.invokeStatic (checker.clj:71)\n jepsen.checker$check_safe.invoke (checker.clj:64)\n jepsen.independent$checker$reify__2302$fn__2303.invoke (independent.clj:272)\n dom_top.core$bounded_pmap$launcher__218$fn__219.invoke (core.clj:129)\n clojure.core$binding_conveyor_fn$fn__4676.invoke (core.clj:1938)\n clojure.lang.AFn.call (AFn.java:18)\n java.util.concurrent.FutureTask.run (FutureTask.java:266)\n java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1149)\n java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:624)\n java.lang.Thread.run (Thread.java:748)\nCaused by: java.lang.AssertionError: Assert failed: Process completed an operation without a prior invocation: {:type :ok, :f :read, :value 0, :process 1, :time 47716834600, :index 623}\ni\n at knossos.history$complete_fold_op.invokeStatic (history.clj:187)\n knossos.history$complete_fold_op.invoke (history.clj:160)\n clojure.lang.ArrayChunk.reduce (ArrayChunk.java:63)\n clojure.core.protocols$fn__6750.invokeStatic (protocols.clj:136)\n clojure.core.protocols/fn (protocols.clj:124)\n clojure.core.protocols$fn__6710$G__6705__6719.invoke (protocols.clj:19)\n clojure.core.protocols$seq_reduce.invokeStatic (protocols.clj:31)\n clojure.core.protocols$fn__6738.invokeStatic (protocols.clj:75)\n clojure.core.protocols/fn (protocols.clj:75)\n clojure.core.protocols$fn__6684$G__6679__6697.invoke (protocols.clj:13)\n clojure.core$reduce.invokeStatic (core.clj:6545)\n clojure.core$reduce.invoke (core.clj:6527)\n knossos.history$complete.invokeStatic (history.clj:241)\n knossos.history$complete.invoke (history.clj:221)\n jepsen.checker.timeline$html$reify__2386.check (timeline.clj:163)\n jepsen.checker$check_safe.invokeStatic (checker.clj:71)\n jepsen.checker$check_safe.invoke (checker.clj:64)\n jepsen.checker$compose$reify__3952$fn__3954.invoke (checker.clj:87)\n clojure.core$pmap$fn__6970$fn__6971.invoke (core.clj:6736)\n clojure.core$binding_conveyor_fn$fn__4676.invoke (core.clj:1938)\n clojure.lang.AFn.call (AFn.java:18)\n java.util.concurrent.FutureTask.run (FutureTask.java:266)\n java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1149)\n java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:624)\n java.lang.Thread.run (Thread.java:748)\n"},

itamararjuan commented 5 years ago

it appears that the expected history value is 1 while 0 was returned and that is what bothers knossos rather than the :invoke missing... or am I missing something ? :-)

-- Itamar

aphyr commented 5 years ago

I'm really not sure! This could be a bug in complete-fold-op.... Might be worth dropping some logging statements in there and watching the state machine evolve.Also note that this comes from the timeline checker, which happens to use some knossos code for history preprocessing--just in case you're trying to trace where the inputs happen here.

aphyr commented 5 years ago

Hmmm. This still feels like a bug in knossos, but now that I'm not on my phone and can read properly, I notice that your invoke and complete ops have different shapes--the value for the invoke is [1 nil], but the returned value is 0. That feels like you might have a client that isn't constructing appropriate responses--like maybe it's taking [key value] pairs and forgetting to construct [key value] completions, instead just returning values. I don't thiiink that should break this particular knossos codepath, though.

itamararjuan commented 5 years ago

Indeed my client invoke! method returns the register's value as is after casting to int of course (I based my test on your etcd example from scratch)

I wasn't aware of the fact I need to return an array or anything else So I guess I'll have a look at what your etcd Clojure client code returns in the original test to see I'm aligned with that.

My client code looks like so

`(defn client "A client for a single compare-and-set register" [conn] (reify client/Client (open! [_ test node] (client node))

(invoke! [this test op]
  (case (:f op)
    :read (assoc op :type :ok, :value (gamma-cli-read-singular conn))
    :write (do (gamma-cli-write-singular-value (:value op) conn (get op :process))
               (assoc op :type, :ok))

    :cas (try+
          (let [[old new] (:value op)]
            (assoc op :type (if (gamma-cli-singular-cas old new conn (get op :process))
                              :ok
                              :fail)))
          (catch [:errorCode 100] ex
            (assoc op :type :fail, :error :not-found)))))

; If our connection were stateful, we'd close it here.
(close! [_ _])

(setup! [_ _])
(teardown! [_ _])))`
craigpastro commented 5 years ago

Hi @aphyr and @itamararjuan. I am having the exact same problem and was wondering if what I have helps. I get the following exception

java.util.concurrent.ExecutionException: java.lang.AssertionError: Assert failed: Process completed an operation without a prior invocation: #knossos.op.Op{:process 6, :type :ok, :f :read, :value nil, :index 2, :time 3420418100}

corresponding to this clipped history

:type :invoke, :f :read, :value [6 nil], :process 6, :time 3159622400, :index 0}
{:type :invoke, :f :read, :value [3 nil], :process 3, :time 3168123700, :index 1}
{:type :ok, :f :read, :value nil, :process 6, :time 3420418100, :index 2}
{:type :ok, :f :read, :value nil, :process 3, :time 3420418100, :index 3}
...

My read function is

(defn r [_ _] {:type :invoke, :f :read, :value nil})

and in my client if I try to read a key which doesn't exist I do a (assoc op :type :ok, :value nil) which seems to be the cause of the exception. If I turn them into (assoc op :type :fail, :error :nil) or something, then everything seems to be okay.

I felt that (assoc op :type :ok, :value nil) should be okay based on the Jepsen tutorial, but should I be turning these into fail type?

Thanks for any help!

Oh, by the way, I am using the following checker:

:checker    (checker/linearizable {:model     (model/cas-register)
                                   :algorithm :linear})
aphyr commented 5 years ago

Hi @siyopao. This error message means that process 6's completion didn't have a matching prior invocation. If you look at process 6's operations in that history, they are:

{:type :invoke, :f :read, :value [6 nil], :process 6, :time 3159622400, :index 0}
{:type :ok, :f :read, :value nil, :process 6, :time 3420418100, :index 2}

In this history, process 6 invoked a read with value [6 nil], and completed that read with value nil. Those values don't match, which is why Knossos is confused. You haven't included your generator or client here, so I can't say for sure, but I suspect you may be using an independent generator, which emits operations of the form [key value], but you haven't updated your client or checker to return and analyze [key value] tuples yet.

craigpastro commented 5 years ago

Hello @aphyr! Thank you for the super fast reply. Well now I feel silly. That is exactly what I missed doing in my client. Thank you for your help!