jepsen-io / jepsen

A framework for distributed systems verification, with fault injection
6.82k stars 717 forks source link

rabbitmq total-queue checker IllegalArgumentException #282

Open dsandubb opened 6 years ago

dsandubb commented 6 years ago

When running the RabbitMQ queue test, the total-queue checker gets an java.lang.IllegalArgumentException.

java.lang.IllegalArgumentException: Don't know how to create ISeq from: clojure.lang.Keyword
        at clojure.lang.RT.seqFrom(RT.java:550) ~[clojure-1.9.0.jar:na]
        at clojure.lang.RT.seq(RT.java:530) ~[clojure-1.9.0.jar:na]
        at clojure.core$seq__5124.invokeStatic(core.clj:137) ~[clojure-1.9.0.jar:na]
        at clojure.core$map$fn__5587.invoke(core.clj:2738) ~[clojure-1.9.0.jar:na]
        at clojure.lang.LazySeq.sval(LazySeq.java:40) ~[clojure-1.9.0.jar:na]
        at clojure.lang.LazySeq.seq(LazySeq.java:49) ~[clojure-1.9.0.jar:na]
        at clojure.lang.RT.seq(RT.java:528) ~[clojure-1.9.0.jar:na]
        at clojure.core$seq__5124.invokeStatic(core.clj:137) ~[clojure-1.9.0.jar:na]
        at clojure.core$apply.invokeStatic(core.clj:652) ~[clojure-1.9.0.jar:na]
        at clojure.core$mapcat.invokeStatic(core.clj:2775) ~[clojure-1.9.0.jar:na]
        at clojure.core$mapcat.doInvoke(core.clj:2775) ~[clojure-1.9.0.jar:na]
        at clojure.lang.RestFn.invoke(RestFn.java:423) ~[clojure-1.9.0.jar:na]
        at jepsen.checker$expand_queue_drain_ops$fn__3997.invoke(checker.clj:231) ~[jepsen-0.1.9.jar:na]
        at clojure.lang.PersistentVector.reduce(PersistentVector.java:341) ~[clojure-1.9.0.jar:na]
        at clojure.core$reduce.invokeStatic(core.clj:6747) ~[clojure-1.9.0.jar:na]
        at clojure.core$reduce.invoke(core.clj:6730) ~[clojure-1.9.0.jar:na]
        at jepsen.checker$expand_queue_drain_ops.invokeStatic(checker.clj:221) ~[jepsen-0.1.9.jar:na]
        at jepsen.checker$expand_queue_drain_ops.invoke(checker.clj:216) ~[jepsen-0.1.9.jar:na]
        at jepsen.checker$total_queue$reify__4005.check(checker.clj:257) ~[jepsen-0.1.9.jar:na]
        at jepsen.checker$check_safe.invokeStatic(checker.clj:71) [jepsen-0.1.9.jar:na]
        at jepsen.checker$check_safe.invoke(checker.clj:64) [jepsen-0.1.9.jar:na]
        at jepsen.checker$compose$reify__3952$fn__3954.invoke(checker.clj:87) [jepsen-0.1.9.jar:na]
        at clojure.core$pmap$fn__8105$fn__8106.invoke(core.clj:6942) [clojure-1.9.0.jar:na]
        at clojure.core$binding_conveyor_fn$fn__5476.invoke(core.clj:2022) [clojure-1.9.0.jar:na]
        at clojure.lang.AFn.call(AFn.java:18) [clojure-1.9.0.jar:na]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_171]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_171]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_171]
        at java.lang.Thread.run(Thread.java:748) [na:1.8.0_171]

Using a previous version of the checker (commit eb796cfcc204c592545965968bd28ad1e6b2eff0) works perfectly, however. Could you please look into that? Thanks a lot.

aphyr commented 6 years ago

Hi Denisa, I'm moving right now and don't have a lot of bandwidth for OSS stuff, but this souuunds like it's something to do with the way the new queue checkers expand drain operations? It's been several years since I was in this code so I don't recall exactly how that works :(

dsandubb commented 6 years ago

Oh, you're right. I looked into it for a bit. Looks like the issue comes from this line, as mapcat would like to map that on an array, to my understanding. After running the tests with that line changed to [(:value op)], I get something like:

:total-queue
 {:valid? false,
  :lost
  #{1425 1391 1405 1441 1414 1406 1408 1435 1396 1393 1394 1402 1410
    1452 1431 1400 1403 1384 1386 1423 1389 1407 1412 1420 1421 1387
    1418 2976 1397 1382 1446 1415 1449 1434},
  [...]
  :unexpected #{:exhausted :exhausted :exhausted :exhausted},
  [...]}

(:value op) holds :exhausted, but I'm not exactly sure whether or not that's the point of the code, because to my understanding, a drain operation should, at some point, lead to the queue being exhausted.

aphyr commented 6 years ago

Right... Is there a chance you're trying to upgrade those tests to work with a newer version of Jepsen? The API around queue checking changed a while ago to simplify draining queues, and that would create the kinds of compatibility issues you're finding right now. If I recall, drains used to be implemented by having the client do a bunch of dequeues and directly modifying the test history, adding in new synthetic dequeue operations. It'd then return :exhausted as the value for the dequeue op, which was ignored by the checker. The new checker (I think this is in the docstring maybe?) doesn't make the client do all that work--it can return a list of dequeued elements as the value of the dequeue operation directly.

aphyr commented 6 years ago

For an example of how drains work now, see https://github.com/jepsen-io/jepsen/blob/master/hazelcast/src/jepsen/hazelcast.clj#L230-L234. Contrast that with https://github.com/jepsen-io/jepsen/blob/master/rabbitmq/src/jepsen/rabbitmq.clj#L171-L180, which has to generate a sequence of synthetic dequeue invocations & completions, and call conj-op! to insert them into the history directly. If you're upgrading to a recent Jepsen, I think you can cut all that code out and replace it with a drain like hazelcast's test does.