jepsen-io / jepsen

A framework for distributed systems verification, with fault injection
6.81k stars 719 forks source link

Generators executing sequentially #197

Closed korry8911 closed 7 years ago

korry8911 commented 7 years ago

Hi,

I am trying to run a test with the following generator:

:generator (->> (independent/concurrent-generator
                                          (:threads-per-doc opts)
                                          (range 1 (+ (:num-docs opts) 1))
                                          (fn [k]
                                              (->> (gen/mix [r w cas])
                                                   (gen/stagger 1/10)
                                                   (gen/limit 100))))
                                        (gen/nemesis
                                          (gen/seq (cycle [(gen/sleep (:nem-stop-time opts))
                                                           {:type :info, :f :start}
                                                           (gen/sleep (:nem-start-time opts))
                                                           {:type :info, :f :stop}])))
                                        (gen/time-limit (:time-limit opts)))

The idea is to have multiple threads r w and cas against multiple independent registers while the nemesis is doing its thing concurrently. The problem I am running into is that with this setup, the r w and cas ops are all done before the nemesis runs:

INFO [2017-09-12 18:27:47,272] jepsen worker 0 - jepsen.util 0  :invoke :read   [1 nil]
INFO [2017-09-12 18:27:47,274] jepsen worker 0 - jepsen.util 0  :ok :read   [1 4]
INFO [2017-09-12 18:27:47,274] jepsen worker 0 - jepsen.core Worker 0 done
INFO [2017-09-12 18:27:47,274] jepsen test runner - jepsen.core Waiting for nemesis to complete
INFO [2017-09-12 18:27:51,643] jepsen nemesis - jepsen.util :nemesis    :info   :start  nil
INFO [2017-09-12 18:27:53,149] jepsen nemesis - jepsen.util :nemesis    :info   :start  "Cut off {\"n5\" #{\"n2\" \"n1\" \"n3\"}, \"n6\" #{\"n2\" \"n1\" \"n3\"}, \"n4\" #{\"n2\" \"n1\" \"n3\"}, \"n2\" #{\"n5\" \"n6\" \"n4\"}, \"n1\" #{\"n5\" \"n6\" \"n4\"}, \"n3\" #{\"n5\" \"n6\" \"n4\"}}"

INFO [2017-09-12 18:28:03,150] jepsen nemesis - jepsen.util :nemesis    :info   :stop   nil
INFO [2017-09-12 18:28:03,365] jepsen nemesis - jepsen.util :nemesis    :info   :stop   "fully connected"
INFO [2017-09-12 18:28:13,366] jepsen nemesis - jepsen.util :nemesis    :info   :start  nil
INFO [2017-09-12 18:28:14,797] jepsen nemesis - jepsen.util :nemesis    :info   :start  "Cut off {\"n2\" #{\"n5\" \"n6\" \"n4\"}, \"n1\" #{\"n5\" \"n6\" \"n4\"}, \"n3\" #{\"n5\" \"n6\" \"n4\"}, \"n5\" #{\"n2\" \"n1\" \"n3\"}, \"n6\" #{\"n2\" \"n1\" \"n3\"}, \"n4\" #{\"n2\" \"n1\" \"n3\"}}"
INFO [2017-09-12 18:28:24,797] jepsen nemesis - jepsen.util :nemesis    :info   :stop   nil
INFO [2017-09-12 18:28:25,012] jepsen nemesis - jepsen.util :nemesis    :info   :stop   "fully connected"
INFO [2017-09-12 18:28:35,012] jepsen nemesis - jepsen.util :nemesis    :info   :start  nil
INFO [2017-09-12 18:28:36,359] jepsen nemesis - jepsen.util :nemesis    :info   :start  "Cut off {\"n2\" #{\"n6\" \"n1\" \"n3\"}, \"n5\" #{\"n6\" \"n1\" \"n3\"}, \"n4\" #{\"n6\" \"n1\" \"n3\"}, \"n6\" #{\"n2\" \"n5\" \"n4\"}, \"n1\" #{\"n2\" \"n5\" \"n4\"}, \"n3\" #{\"n2\" \"n5\" \"n4\"}}"
INFO [2017-09-12 18:28:46,360] jepsen nemesis - jepsen.util :nemesis    :info   :stop   nil
INFO [2017-09-12 18:28:46,572] jepsen nemesis - jepsen.util :nemesis    :info   :stop   "fully connected"
INFO [2017-09-12 18:28:46,572] jepsen nemesis - jepsen.core nemesis done
INFO [2017-09-12 18:28:46,572] jepsen test runner - jepsen.core nemesis done.
INFO [2017-09-12 18:28:46,572] jepsen test runner - jepsen.core Tearing down nemesis
INFO [2017-09-12 18:28:46,785] jepsen test runner - jepsen.core Nemesis torn down

I'm probably creating the generators in the wrong way, but interested to know if you see what the issue is.

aphyr commented 7 years ago

This... looks ok to me. It's a bit weird that you're using both a time limit and a fixed number of keys for concurrent-generator though. Did you want to stop running tests before the time limit expires? Possible you're running out of ops and keys from concurrent-generator before the nemesis starts?On Sep 12, 2017 19:56, Korrigan Clark notifications@github.com wrote:Hi, I am trying to run a test with the following generator: :generator (->> (independent/concurrent-generator (:threads-per-doc opts) (range 1 (+ (:num-docs opts) 1)) (fn [k] (->> (gen/mix [r w cas]) (gen/stagger 1/10) (gen/limit 100)))) (gen/nemesis (gen/seq (cycle [(gen/sleep (:nem-stop-time opts)) {:type :info, :f :start} (gen/sleep (:nem-start-time opts)) {:type :info, :f :stop}]))) (gen/time-limit (:time-limit opts)))

The idea is to have multiple threads r w and cas against multiple independent registers while the nemesis is doing its thing concurrently. The problem I am running into is that with this setup, the r w and cas ops are all done before the nemesis runs: INFO [2017-09-12 18:27:47,272] jepsen worker 0 - jepsen.util 0 :invoke :read [1 nil] INFO [2017-09-12 18:27:47,274] jepsen worker 0 - jepsen.util 0 :ok :read [1 4] INFO [2017-09-12 18:27:47,274] jepsen worker 0 - jepsen.core Worker 0 done INFO [2017-09-12 18:27:47,274] jepsen test runner - jepsen.core Waiting for nemesis to complete INFO [2017-09-12 18:27:51,643] jepsen nemesis - jepsen.util :nemesis :info :start nil INFO [2017-09-12 18:27:53,149] jepsen nemesis - jepsen.util :nemesis :info :start "Cut off {\"n5\" #{\"n2\" \"n1\" \"n3\"}, \"n6\" #{\"n2\" \"n1\" \"n3\"}, \"n4\" #{\"n2\" \"n1\" \"n3\"}, \"n2\" #{\"n5\" \"n6\" \"n4\"}, \"n1\" #{\"n5\" \"n6\" \"n4\"}, \"n3\" #{\"n5\" \"n6\" \"n4\"}}"

INFO [2017-09-12 18:28:03,150] jepsen nemesis - jepsen.util :nemesis :info :stop nil INFO [2017-09-12 18:28:03,365] jepsen nemesis - jepsen.util :nemesis :info :stop "fully connected" INFO [2017-09-12 18:28:13,366] jepsen nemesis - jepsen.util :nemesis :info :start nil INFO [2017-09-12 18:28:14,797] jepsen nemesis - jepsen.util :nemesis :info :start "Cut off {\"n2\" #{\"n5\" \"n6\" \"n4\"}, \"n1\" #{\"n5\" \"n6\" \"n4\"}, \"n3\" #{\"n5\" \"n6\" \"n4\"}, \"n5\" #{\"n2\" \"n1\" \"n3\"}, \"n6\" #{\"n2\" \"n1\" \"n3\"}, \"n4\" #{\"n2\" \"n1\" \"n3\"}}" INFO [2017-09-12 18:28:24,797] jepsen nemesis - jepsen.util :nemesis :info :stop nil INFO [2017-09-12 18:28:25,012] jepsen nemesis - jepsen.util :nemesis :info :stop "fully connected" INFO [2017-09-12 18:28:35,012] jepsen nemesis - jepsen.util :nemesis :info :start nil INFO [2017-09-12 18:28:36,359] jepsen nemesis - jepsen.util :nemesis :info :start "Cut off {\"n2\" #{\"n6\" \"n1\" \"n3\"}, \"n5\" #{\"n6\" \"n1\" \"n3\"}, \"n4\" #{\"n6\" \"n1\" \"n3\"}, \"n6\" #{\"n2\" \"n5\" \"n4\"}, \"n1\" #{\"n2\" \"n5\" \"n4\"}, \"n3\" #{\"n2\" \"n5\" \"n4\"}}" INFO [2017-09-12 18:28:46,360] jepsen nemesis - jepsen.util :nemesis :info :stop nil INFO [2017-09-12 18:28:46,572] jepsen nemesis - jepsen.util :nemesis :info :stop "fully connected" INFO [2017-09-12 18:28:46,572] jepsen nemesis - jepsen.core nemesis done INFO [2017-09-12 18:28:46,572] jepsen test runner - jepsen.core nemesis done. INFO [2017-09-12 18:28:46,572] jepsen test runner - jepsen.core Tearing down nemesis INFO [2017-09-12 18:28:46,785] jepsen test runner - jepsen.core Nemesis torn down

I'm probably creating the generators is the wrong way, but interested to know if you see what the issue is.

—You are receiving this because you are subscribed to this thread.Reply to this email directly, view it on GitHub, or mute the thread.

korry8911 commented 7 years ago

I want the test to run for the whole time-limit, but I see that gen/limit might deplete before the time-limit is reached and before the nemesis actually starts. I will try without the gen/limit and a shorter nem-start/stop-time. I want to keep the keys finite since having a infinite sequence of keys isn't really necessary for me (at least as far as I can tell).

"It's a bit weird that you're using both a time limit and a fixed number of keys " : Why would this been weird together? I don't really see the relation between the two. Do you mean fixed number of ops?

Thanks for the quick response, going to try this, will follow up shortly...

korry8911 commented 7 years ago

So I got the desired behaviour by remove the gen/limit, silly mistake by me, the generators were being depleted before the time-limit was up and before the nemesis' first invocation was happening:

:generator (->> (independent/concurrent-generator
                                        (:threads-per-doc opts)
                                        (range 1 (+ (:num-docs opts) 1))
                                        (fn [k]
                                            (->> (gen/mix [r w cas])
                                                 (gen/stagger 1/10))))
                                      (gen/nemesis
                                        (gen/seq (cycle [(gen/sleep (:nem-stop-time opts))
                                                         {:type :info, :f :start}
                                                         (gen/sleep (:nem-start-time opts))
                                                         {:type :info, :f :stop}])))
                                      (gen/time-limit (:time-limit opts)))

But now my issue is this:

ERROR [2017-09-12 21:58:47,499] main - jepsen.cli Oh jeez, I'm sorry, Jepsen broke. Here's why:
java.util.concurrent.ExecutionException: java.lang.IllegalArgumentException: Cannot write (AWrapper #knossos.model.CASRegister{:value 4}) as tag null
    at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[na:1.8.0_131]
    at java.util.concurrent.FutureTask.get(FutureTask.java:192) ~[na:1.8.0_131]
    at clojure.core$deref_future.invokeStatic(core.clj:2208) ~[clojure-1.8.0.jar:na]
    at clojure.core$future_call$reify__6962.deref(core.clj:6688) ~[clojure-1.8.0.jar:na]
    at clojure.core$deref.invokeStatic(core.clj:2228) ~[clojure-1.8.0.jar:na]

I remember seeing this once before, but can't remember how I fixed it. This is my checker and model:

:model     (model/cas-register)
                      :checker   (checker/compose
                                   {:perf     (checker/perf)
                                    :timeline (timeline/html)
                                    :indep (independent/checker
                                             (checker/compose
                                               {:timeline (timeline/html)
                                                :linear   (checker/linearizable :competition)}))})

Does this have to do with the model not having a start value?

Thanks again for the help.

korry8911 commented 7 years ago

OK so that last error I got was caused by the perf checker being within the top-level checker/compose. I also tried to move the perf check to within the indepent/checker's checker/compose and got the following error:

WARN [2017-09-13 19:56:55,736] clojure-agent-send-off-pool-4 - jepsen.checker Error while checking history:
java.lang.ClassCastException: clojure.lang.LazySeq cannot be cast to clojure.lang.Reversible
    at clojure.core$rseq.invokeStatic(core.clj:1532) ~[clojure-1.8.0.jar:na]
    at clojure.core$rseq.invoke(core.clj:1532) ~[clojure-1.8.0.jar:na]
    at jepsen.checker.perf$nemesis_intervals.invokeStatic(perf.clj:175) ~[jepsen-0.1.6-SNAPSHOT.jar:na]
    at jepsen.checker.perf$nemesis_intervals.invoke(perf.clj:169) ~[jepsen-0.1.6-SNAPSHOT.jar:na]
    at jepsen.checker.perf$nemesis_regions.invokeStatic(perf.clj:196) ~[jepsen-0.1.6-SNAPSHOT.jar:na]
    at jepsen.checker.perf$nemesis_regions.invoke(perf.clj:190) ~[jepsen-0.1.6-SNAPSHOT.jar:na]
    at jepsen.checker.perf$rate_graph_BANG_.invokeStatic(perf.clj:319) ~[jepsen-0.1.6-SNAPSHOT.jar:na]
    at jepsen.checker.perf$rate_graph_BANG_.invoke(perf.clj:294) ~[jepsen-0.1.6-SNAPSHOT.jar:na]
    at jepsen.checker$rate_graph$reify__3217.check(checker.clj:404) ~[jepsen-0.1.6-SNAPSHOT.jar:na]
    at jepsen.checker$check_safe.invokeStatic(checker.clj:70) [jepsen-0.1.6-SNAPSHOT.jar:na]
    at jepsen.checker$check_safe.invoke(checker.clj:63) [jepsen-0.1.6-SNAPSHOT.jar:na]
    at jepsen.checker$compose$reify__3207$fn__3209.invoke(checker.clj:386) [jepsen-0.1.6-SNAPSHOT.jar:na]
    at clojure.core$pmap$fn__6970$fn__6971.invoke(core.clj:6736) [clojure-1.8.0.jar:na]
    at clojure.core$binding_conveyor_fn$fn__4676.invoke(core.clj:1938) [clojure-1.8.0.jar:na]
    at clojure.lang.AFn.call(AFn.java:18) [clojure-1.8.0.jar:na]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_131]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_131]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_131]
    at java.lang.Thread.run(Thread.java:748) [na:1.8.0_131]

Not sure exactly whats going on, but it seems like independent/checker and the perf checker don't play nice together. This is the only way I could get the checkers to work, no perf checker:

:checker   (checker/compose {:timeline (timeline/html)
                                                   :indep (independent/checker
                                                            (checker/compose {:timeline (timeline/html)
                                                                              :linear   (checker/linearizable)}))})

Perf checker works fine in other cases where there is no independent checker.

Closing the issue since its more of an FYI now.

Thanks again.