jepsen-io / jepsen

A framework for distributed systems verification, with fault injection
6.69k stars 710 forks source link

Test crashed when failure with :type :info #485

Closed yito88 closed 3 years ago

yito88 commented 3 years ago

I got an error from independent concurrent-generator when I'm migrating my tests from Jepsen 0.1.18 to 0.2.0.

WARN [2020-08-20 09:05:46,585] main - jepsen.core Test crashed!
clojure.lang.ExceptionInfo: Generator threw  #error {
 :cause Key must be integer
 :via
 [{:type java.lang.IllegalArgumentException
   :message Key must be integer
   :at [clojure.lang.APersistentVector assoc APersistentVector.java 347]}]
 :trace
 [[clojure.lang.APersistentVector assoc APersistentVector.java 347]
  [clojure.lang.APersistentVector assoc APersistentVector.java 18]
  [clojure.lang.RT assoc RT.java 827]
  [clojure.core$assoc__5416 invokeStatic core.clj 191]
  [clojure.core$update invokeStatic core.clj 6202]
  [clojure.core$update invoke core.clj 6188]
  [jepsen.independent.ConcurrentGenerator update independent.clj 210]
  [jepsen.generator.OnThreads update generator.clj 851]
  [jepsen.generator.OnThreads update generator.clj 851]
  [jepsen.generator.Any$updater__3853 invoke generator.clj 923]
  [clojure.core$mapv$fn__8445 invoke core.clj 6912]
  [clojure.lang.PersistentVector reduce PersistentVector.java 343]
  [clojure.core$reduce invokeStatic core.clj 6827]
  [clojure.core$mapv invokeStatic core.clj 6903]
  [clojure.core$mapv invoke core.clj 6903]
  [jepsen.generator.Any update generator.clj 923]
  [jepsen.generator.TimeLimit update generator.clj 1231]
  [jepsen.generator.FriendlyExceptions update generator.clj 675]
  [jepsen.generator.Validate update generator.clj 648]
  [jepsen.generator.interpreter$run_BANG_ invokeStatic interpreter.clj 232]
  [jepsen.generator.interpreter$run_BANG_ invoke interpreter.clj 180]
  [jepsen.core$run_case_BANG_ invokeStatic core.clj 220]
  [jepsen.core$run_case_BANG_ invoke core.clj 212]
  [jepsen.core$run_BANG_$fn__9722$fn__9729 invoke core.clj 345]
  [jepsen.core$run_BANG_$fn__9722 invoke core.clj 342]
  [jepsen.core$run_BANG_ invokeStatic core.clj 328]
  [jepsen.core$run_BANG_ invoke core.clj 254]
  [cassandra.runner$test_cmd$fn__2816 invoke runner.clj 108]
  [jepsen.cli$run_BANG_ invokeStatic cli.clj 316]
  [jepsen.cli$run_BANG_ invoke cli.clj 246]
  [cassandra.runner$_main invokeStatic runner.clj 120]
  [cassandra.runner$_main doInvoke runner.clj 118]
  [clojure.lang.RestFn invoke RestFn.java 703]
  [clojure.lang.Var invoke Var.java 447]
  [user$eval140 invokeStatic form-init4635373681402498005.clj 1]
  [user$eval140 invoke form-init4635373681402498005.clj 1]
  [clojure.lang.Compiler eval Compiler.java 7177]
  [clojure.lang.Compiler eval Compiler.java 7167]
  [clojure.lang.Compiler load Compiler.java 7636]
  [clojure.lang.Compiler loadFile Compiler.java 7574]
  [clojure.main$load_script invokeStatic main.clj 475]
  [clojure.main$init_opt invokeStatic main.clj 477]
  [clojure.main$init_opt invoke main.clj 477]
  [clojure.main$initialize invokeStatic main.clj 508]
  [clojure.main$null_opt invokeStatic main.clj 542]
  [clojure.main$null_opt invoke main.clj 539]
  [clojure.main$main invokeStatic main.clj 664]
  [clojure.main$main doInvoke main.clj 616]
  [clojure.lang.RestFn applyTo RestFn.java 137]
  [clojure.lang.Var applyTo Var.java 705]
  [clojure.main main main.java 40]]}
...
Context:

{:time 6236131371,
 :free-threads #object[io.lacuna.bifurcan.Set 0x59d5a6fd "{1}"],
 :workers {0 0, 1 4, 2 2, :nemesis :nemesis}}
Event:

{:type :info,
 :f :write,
 :value [0 1],
 :time 6236131371,
 :process 1,
 :error :write-timed-out}
...
Caused by: java.lang.IllegalArgumentException: Key must be integer
    at clojure.lang.APersistentVector.assoc(APersistentVector.java:347)
    at clojure.lang.APersistentVector.assoc(APersistentVector.java:18)
    at clojure.lang.RT.assoc(RT.java:827)
    at clojure.core$assoc__5416.invokeStatic(core.clj:191)
    at clojure.core$update.invokeStatic(core.clj:6202)
    at clojure.core$update.invoke(core.clj:6188)
    at jepsen.independent.ConcurrentGenerator.update(independent.clj:210)
    at jepsen.generator.OnThreads.update(generator.clj:851)
    at jepsen.generator.OnThreads.update(generator.clj:851)
    at jepsen.generator.Any$updater__3853.invoke(generator.clj:923)
    at clojure.core$mapv$fn__8445.invoke(core.clj:6912)
    at clojure.lang.PersistentVector.reduce(PersistentVector.java:343)
    at clojure.core$reduce.invokeStatic(core.clj:6827)
    at clojure.core$mapv.invokeStatic(core.clj:6903)
    at clojure.core$mapv.invoke(core.clj:6903)
    at jepsen.generator.Any.update(generator.clj:923)
    at jepsen.generator.TimeLimit.update(generator.clj:1231)
    at jepsen.generator.FriendlyExceptions.update(generator.clj:675)
    ... 33 common frames omitted

The error always happens when an operation fails and returns op with {:type :info}. I think :info is required when a test is not sure whether a write has succeeded or not.

In my investigation, each thread-process pair of ctx's worker is basically a pair of a number like {0 0, 1 1, 2 2} when concurrency 3. But, when {:type :info}, the pair has different number like {0 0, 1 4, 2 2}. I found the ctx's worker is updated when {:type :info} as below.

I guess {:type :info} can be used with 0.2.0 because it is used in other tests like scylla or stolen tests. How should I use {:type :info} for unknown writes?

https://github.com/jepsen-io/jepsen/blame/master/jepsen/src/jepsen/generator/interpreter.clj#L225-L230

aphyr commented 3 years ago

I thiiiink what's happening here is that ConcurrentGenerator's update calls jepsen.generator/thread->group based on the generated event, but it's being given the new context resulting from process 1's crash. If you take a look at

https://github.com/jepsen-io/jepsen/blob/28aa0d4f9043d952ed82f21795a57affd0b26d46/jepsen/src/jepsen/generator/interpreter.clj#L232

There's where we perform the generator update--and we use the new context, rather than the same context that produced the operation. This is... a little tricky, but I think the right call here is to use some, but not all, of the context changes: specifically, we want to preserve the new timestamp and thread availability, but leave the thread->process mapping in place for the update, and only after update cycle to a new thread.

I'm busy with a client right now so I don't have time to write a full test case, but I put together a small patch which doesn't break any of the existing tests. Want to try out 8b9f04cc and let me know if that helps?

aphyr commented 3 years ago

Just wanted to let you know--I've had a chance to test this properly, and I've released it under 0.2.1-SNAPSHOT--you should be able to bump your Jepsen dependency and have it work. :)

yito88 commented 3 years ago

@aphyr Thank you so much! Our test worked well with https://github.com/jepsen-io/jepsen/commit/8b9f04cc46426b5c6a3c556777ec69c883c5b013 without the error 👍