jepsen-io / knossos

Verifies the linearizability of experimentally accessible histories.
398 stars 31 forks source link

Question about checking linearizability of a history with a custom model #26

Closed metanet closed 6 years ago

metanet commented 6 years ago

Hi there,

I am not sure if I should open this issue to this repo or the jepsen repo. I am giving a shot here and can port the issue to the jepsen repo if requested.

I have a very simple jepsen test taking 20 seconds and injecting no faults in the environment. It is a mutex test for now, but I will use it for testing re-entrant mutex in future once I make this version work.

In my test, i have 5 clients. Each client is acquiring and releasing the same distributed lock object repeatedly.

Here is my generator:

     :generator (->> [{:type :invoke, :f :acquire}
                                      ;(gen/sleep 1)
                                      {:type :invoke, :f :release}
                                      (gen/sleep 1)]
                                     cycle
                                     gen/seq
                                     gen/each
                                     (gen/stagger 1/10))

Here is my custom model:

(defrecord ReentrantMutex [owner lockCount]
  Model
  (step [this op]
    (if (nil? (:value op))
      (knossos.model/inconsistent "no owner!")
      (condp = (:f op)
        :acquire (if (or (nil? owner) (= owner (:value op)))
                   (createReentrantMutex this (:value op) (inc lockCount) op)
                   (do
                     ;(info (str "cannot acquire owner: " owner " lock count: " lockCount " op: " op " this model hash: " (System/identityHashCode this)))
                     (knossos.model/inconsistent "cannot acquire")))
        :release (if (or (nil? owner) (not= owner (:value op)))
                   (do
                     ;(info (str "cannot release owner: " owner " lock count: " lockCount " op: " op " this model hash: " (System/identityHashCode this)))
                     (knossos.model/inconsistent "cannot release"))
                   (if (= lockCount 1)
                     (createReentrantMutex this nil 0 op)
                     (createReentrantMutex this owner (dec lockCount) op))))))

  Object
  (toString [this] (str "owner: " owner ", lockCount: " lockCount)))

(defn createEmptyReentrantMutex []
  (let [
        initial (ReentrantMutex. nil 0)
        ;_ (info (str "initial model: " (System/identityHashCode initial)))
        ]
    initial)
  )

(defn createReentrantMutex [prev owner lockCount op]
  (
    let [
         newModel (ReentrantMutex. owner lockCount)
         ;_ (info (str "owner: " owner ", count: " lockCount ", op: " op " new hash: " (System/identityHashCode newModel) " prev owner: " (:owner prev) " prev lock count: " (:lockCount prev) " prev hash:" (System/identityHashCode prev)))
         ]
    newModel)
  )

My model is the ReentrantMutex defrecord. Initially, it is ReentrantMutex(nil, 0), means nobody holding the lock. For instance, if n1 acquires the lock, it will be ReentrantMutex("n1", 1).

When I run this test for 20 seconds without introducing any failure in the system, I get a history of operations.

I am sharing my history files and the timeline file here. history.txt history.edn.txt timeline.html.txt

Sorry for the file extensions. Please rename "history.edn.txt" to "history.edn" and "timeline.html.txt" to "timeline.html.

In short, the history of operations is this one: N3_ACQ => N3_REL => N2_ACQ => N2_REL => N4_ACQ => N4_REL => N1_ACQ => N1_REL => N5_ACQ => N5_REL => N3_ACQ => N3_REL => N2_ACQ => N2_REL => N4_ACQ => N4_REL => N1_ACQ => N1_REL => N5_ACQ => N5_REL => N3_ACQ => N3_REL => N2_ACQ => N2_REL => N4_ACQ

So the lock is held by N3, N2, N4, N1, N5, N3, N2, N4, N1, N5, N3, N2, N4 in order. You can verify this by checking the history files and the timeline file.

The thing is, when the analyzer starts running with my model, it never terminates (when info logs above are commented out). I cannot say that I fully understand how it works internally but I see that it is walking through possible histories based on concurrency of operations, but each one failing with one of the knossos.model/inconsistent returns in my model. Moreover, as I run it with logs printed, I see states like ReentrantMutex("n2", 555). Probably, I am misunderstanding something, but I don't expect to see such states because acquire and release operations of the same client are sequential. Since I have a short history, I expect the analyzer to complete the verification step in short time. Therefore, I suspect I am doing something wrong in my model, or using the tool in some wrong way.

I have tried many things, but couldn't manage to make this model work. Could you help me with this issue? Any hint or a pointer would be very helpful.

Thanks in advance.

aphyr commented 6 years ago

Having two definitions for createReentrantMutex looks a little unusual to me--I think normally this would be (defn createReentrantMutex ([] ...) ([arg1 arg2 arg3] ...)), but if it compiles and you're not getting arity exceptions, maybe it works! This miiiight be due to an infinite state space interacting poorly with model memoization code that runs before the analysis begins; we normally test with finite, compact spaces, and this one is countably infinite, so that miiiight be what's going on. Can you get a thread dump of the process using jstack or similar?

metanet commented 6 years ago

Thanks for the response.

I know it is not idiomatic clojure :) I am a clojure newbie. The code above works fine. The very first model objects are created with the createEmptyReentrantMutex function and then the rest goes on ReentrantMutex and createReentrantMutex.

Btw, I also run this test for 10 secs, only with a few acquire and releases, but still no luck.

I will get back to you with a thread dump.

metanet commented 6 years ago

Here is a thread dump for another run of the same test. dump2.txt

aphyr commented 6 years ago

Yep, you're being bit by the memorization phase, which is designed for finite state spaces--but you've got an infinite state space with this model. You're in luck, I actually have some time today for the first time in months, and I know how to fix this, so I'll take a shot at it later on :)On Aug 17, 2018 05:31, Ensar Basri Kahveci notifications@github.com wrote:Here is a thread dump for another run of the same test. dump2.txt

—You are receiving this because you commented.Reply to this email directly, view it on GitHub, or mute the thread.

metanet commented 6 years ago

that would be very helpful. thanks a lot.

looking forward to get the fix.

metanet commented 6 years ago

I have been cleaning up my code to give you a reproducer. You can run the test via https://github.com/metanet/jepsen/tree/custom-model if you like. Just cd into the hazelcast directory and hit lein run test --workload lock --time-limit 20.

aphyr commented 6 years ago

Right, I've released 0.3.3-SNAPSHOT which aborts memoization for large state spaces. #25, you may also find this of use! Would y'all like to give that a shot and see if it helps?

metanet commented 6 years ago

thanks a lot. I will re-run it and ping you back.

metanet commented 6 years ago

I updated the knossos dependency of jepsen to 0.3.4-SNAPSHOT, re-installed jepsen, and run the test, but still encountering the same issue :(

metanet commented 6 years ago

ok, I have some good news. your explanation about the search space helped me to simplify my model a little bit. My goal is to test reentrancy, such that a client acquires the lock 2 times, and releases it. However, my model turned out to be too loose for this test. So I made the following simple change to reduce the search space and now the analysis step terminates successfully.

Thanks for your help.

(defrecord ReentrantMutex [owner lockCount]
  Model
  (step [this op]
    (if (nil? (:value op))
      (knossos.model/inconsistent "no owner!")
      (condp = (:f op)
        :acquire (if (and (< lockCount 3) (or (nil? owner) (= owner (:value op)))) ;;; <<<=== CHANGED LINE
                   (createReentrantMutex this (:value op) (inc lockCount) op)
                   (do
                     ;(info (str "cannot acquire owner: " owner " lock count: " lockCount " op: " op " this model hash: " (System/identityHashCode this)))
                     (knossos.model/inconsistent "cannot acquire")))
        :release (if (or (nil? owner) (not= owner (:value op)))
                   (do
                     ;(info (str "cannot release owner: " owner " lock count: " lockCount " op: " op " this model hash: " (System/identityHashCode this)))
                     (knossos.model/inconsistent "cannot release"))
                   (if (= lockCount 1)
                     (createReentrantMutex this nil 0 op)
                     (createReentrantMutex this owner (dec lockCount) op))))))