jepsen-io / maelstrom

A workbench for writing toy implementations of distributed systems.
Eclipse Public License 1.0
3.05k stars 189 forks source link

Kafka workflow crashes with NullPointerException #79

Open enigma opened 1 year ago

enigma commented 1 year ago

If maelstrom gets a binary that ignores all the kafka commands, for example like this modified demo/clojure/echo.clj, then maelstrom fails with a NullPointerException rather than some useful errors.

For other workflows, eg broadcast/g-counter, maelstroms fails with some timeout exceptions that are more understandable.

Is this the correct way to fail for the kafka workflow?

$ lein run -- test -w kafka --bin demo/clojure/lenient_echo.clj # see gist
[...]
INFO [2023-09-13 10:24:40,229] jepsen node n1 - maelstrom.net Shutting down Maelstrom network
INFO [2023-09-13 10:24:40,233] jepsen test runner - jepsen.core Analyzing...
INFO [2023-09-13 10:24:40,294] clojure-agent-send-off-pool-5 - jepsen.tests.kafka Wrote /Users/ndr/coding/maelstrom-src/maelstrom/store/kafka/20230913T102325.079+0100/consume-counts.edn
WARN [2023-09-13 10:24:40,295] clojure-agent-send-off-pool-5 - jepsen.checker Error while checking history:
java.lang.NullPointerException: Cannot invoke "java.lang.Number.doubleValue()" because "x" is null
        at clojure.lang.Numbers.divide(Numbers.java:3899)
        at jepsen.util$nanos__GT_secs.invokeStatic(util.clj:386)
        at jepsen.util$nanos__GT_secs.invoke(util.clj:386)
        at clojure.core$update.invokeStatic(core.clj:6231)
        at clojure.core$update.invoke(core.clj:6223)
        at jepsen.tests.kafka$checker$reify__19272.check(kafka.clj:2080)
        at jepsen.checker$check_safe.invokeStatic(checker.clj:86)
        at jepsen.checker$check_safe.invoke(checker.clj:79)
        at jepsen.checker$compose$reify__10709$fn__10711.invoke(checker.clj:102)
        at clojure.core$pmap$fn__8552$fn__8553.invoke(core.clj:7089)
        at clojure.core$binding_conveyor_fn$fn__5823.invoke(core.clj:2047)
        at clojure.lang.AFn.call(AFn.java:18)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1623)
INFO [2023-09-13 10:24:40,428] jepsen test runner - jepsen.core Analysis complete
INFO [2023-09-13 10:24:40,434] jepsen results - jepsen.store Wrote /Users/ndr/coding/maelstrom-src/maelstrom/store/kafka/20230913T102325.079+0100/results.edn
INFO [2023-09-13 10:24:40,456] jepsen test runner - jepsen.core {:perf {:latency-graph {:valid? true},
        :rate-graph {:valid? true},
        :valid? true},
 :timeline {:valid? true},
 :exceptions {:valid? true},
 :stats {:valid? false,
         :count 67,
         :ok-count 0,
         :fail-count 11,
         :info-count 56,
         :by-f {:assign {:valid? false,
                         :count 11,
                         :ok-count 0,
                         :fail-count 11,
                         :info-count 0},
                :crash {:valid? false,
                        :count 7,
                        :ok-count 0,
                        :fail-count 0,
                        :info-count 7},
                :poll {:valid? false,
                       :count 20,
                       :ok-count 0,
                       :fail-count 0,
                       :info-count 20},
                :send {:valid? false,
                       :count 29,
                       :ok-count 0,
                       :fail-count 0,
                       :info-count 29}}},
 :availability {:valid? true, :ok-fraction 0.0},
 :net {:all {:send-count 70,
             :recv-count 70,
             :msg-count 70,
             :msgs-per-op 1.0447761},
       :clients {:send-count 70, :recv-count 70, :msg-count 70},
       :servers {:send-count 0,
                 :recv-count 0,
                 :msg-count 0,
                 :msgs-per-op 0.0},
       :valid? true},
 :workload {:valid? :unknown,
            :error "java.lang.NullPointerException: Cannot invoke \"java.lang.Number.doubleValue()\" because \"x\" is null\n at clojure.lang.Numbers.divide (Numbers.java:3899)\n    jepsen.util$nanos__GT_secs.invokeStatic (util.clj:386)\n    jepsen.util$nanos__GT_secs.invoke (util.clj:386)\n    clojure.core$update.invokeStatic (core.clj:6231)\n    clojure.core$update.invoke (core.clj:6223)\n    jepsen.tests.kafka$checker$reify__19272.check (kafka.clj:2080)\n    jepsen.checker$check_safe.invokeStatic (checker.clj:86)\n    jepsen.checker$check_safe.invoke (checker.clj:79)\n    jepsen.checker$compose$reify__10709$fn__10711.invoke (checker.clj:102)\n    clojure.core$pmap$fn__8552$fn__8553.invoke (core.clj:7089)\n    clojure.core$binding_conveyor_fn$fn__5823.invoke (core.clj:2047)\n    clojure.lang.AFn.call (AFn.java:18)\n    java.util.concurrent.FutureTask.run (FutureTask.java:317)\n    java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1144)\n    java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:642)\n    java.lang.Thread.run (Thread.java:1623)\n"},
 :valid? false}

Analysis invalid! (ノಥ益ಥ)ノ ┻━┻
aphyr commented 1 year ago

That's probably best understood as a checker bug--I'll bet you that part of the checker is trying to find e.g. the time between the first and last successful ops, and if there aren't any, it blows up, or something like that. Feel free to dig into the Kafka checker in Jepsen if you'd like to fix this one--I imagine it should be a simple check to make sure at least a handful of required ops actually exist.On Sep 13, 2023 05:37, Andrea Ambu @.***> wrote: If maelstrom gets a binary that ignores all the kafka commands, for example like this modified demo/clojure/echo.clj, then maelstrom fails with a NullPointerException rather than some useful errors. For other workflows, eg broadcast/g-counter, maelstroms fails with some timeout exceptions that are more understandable. Is this the correct way to fail for the kafka workflow? $ lein run -- test -w kafka --bin demo/clojure/lenient_echo.clj # see gist [...] INFO [2023-09-13 10:24:40,229] jepsen node n1 - maelstrom.net Shutting down Maelstrom network INFO [2023-09-13 10:24:40,233] jepsen test runner - jepsen.core Analyzing... INFO [2023-09-13 10:24:40,294] clojure-agent-send-off-pool-5 - jepsen.tests.kafka Wrote /Users/ndr/coding/maelstrom-src/maelstrom/store/kafka/20230913T102325.079+0100/consume-counts.edn WARN [2023-09-13 10:24:40,295] clojure-agent-send-off-pool-5 - jepsen.checker Error while checking history: java.lang.NullPointerException: Cannot invoke "java.lang.Number.doubleValue()" because "x" is null at clojure.lang.Numbers.divide(Numbers.java:3899) at jepsen.util$nanosGT_secs.invokeStatic(util.clj:386) at jepsen.util$nanos__GT_secs.invoke(util.clj:386) at clojure.core$update.invokeStatic(core.clj:6231) at clojure.core$update.invoke(core.clj:6223) at jepsen.tests.kafka$checker$reify19272.check(kafka.clj:2080) at jepsen.checker$check_safe.invokeStatic(checker.clj:86) at jepsen.checker$check_safe.invoke(checker.clj:79) at jepsen.checker$compose$reify10709$fn10711.invoke(checker.clj:102) at clojure.core$pmap$fn8552$fn8553.invoke(core.clj:7089) at clojure.core$binding_conveyor_fn$fn5823.invoke(core.clj:2047) at clojure.lang.AFn.call(AFn.java:18) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) at java.base/java.lang.Thread.run(Thread.java:1623) INFO [2023-09-13 10:24:40,428] jepsen test runner - jepsen.core Analysis complete INFO [2023-09-13 10:24:40,434] jepsen results - jepsen.store Wrote /Users/ndr/coding/maelstrom-src/maelstrom/store/kafka/20230913T102325.079+0100/results.edn INFO [2023-09-13 10:24:40,456] jepsen test runner - jepsen.core {:perf {:latency-graph {:valid? true}, :rate-graph {:valid? true}, :valid? true}, :timeline {:valid? true}, :exceptions {:valid? true}, :stats {:valid? false, :count 67, :ok-count 0, :fail-count 11, :info-count 56, :by-f {:assign {:valid? false, :count 11, :ok-count 0, :fail-count 11, :info-count 0}, :crash {:valid? false, :count 7, :ok-count 0, :fail-count 0, :info-count 7}, :poll {:valid? false, :count 20, :ok-count 0, :fail-count 0, :info-count 20}, :send {:valid? false, :count 29, :ok-count 0, :fail-count 0, :info-count 29}}}, :availability {:valid? true, :ok-fraction 0.0}, :net {:all {:send-count 70, :recv-count 70, :msg-count 70, :msgs-per-op 1.0447761}, :clients {:send-count 70, :recv-count 70, :msg-count 70}, :servers {:send-count 0, :recv-count 0, :msg-count 0, :msgs-per-op 0.0}, :valid? true}, :workload {:valid? :unknown, :error "java.lang.NullPointerException: Cannot invoke \"java.lang.Number.doubleValue()\" because \"x\" is null\n at clojure.lang.Numbers.divide (Numbers.java:3899)\n jepsen.util$nanosGT_secs.invokeStatic (util.clj:386)\n jepsen.util$nanos__GT_secs.invoke (util.clj:386)\n clojure.core$update.invokeStatic (core.clj:6231)\n clojure.core$update.invoke (core.clj:6223)\n jepsen.tests.kafka$checker$reify19272.check (kafka.clj:2080)\n jepsen.checker$check_safe.invokeStatic (checker.clj:86)\n jepsen.checker$check_safe.invoke (checker.clj:79)\n jepsen.checker$compose$reify10709$fn10711.invoke (checker.clj:102)\n clojure.core$pmap$fn8552$fn8553.invoke (core.clj:7089)\n clojure.core$binding_conveyor_fn$fn5823.invoke (core.clj:2047)\n clojure.lang.AFn.call (AFn.java:18)\n java.util.concurrent.FutureTask.run (FutureTask.java:317)\n java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1144)\n java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:642)\n java.lang.Thread.run (Thread.java:1623)\n"}, :valid? false}

Analysis invalid! (ノಥ益ಥ)ノ ┻━┻

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.***>

enigma commented 1 year ago

Thanks, I've shared a tentative fix at https://github.com/jepsen-io/jepsen/pull/580 - comments welcome!