taoensso / carmine

Redis client + message queue for Clojure
https://www.taoensso.com/carmine
Eclipse Public License 1.0
1.16k stars 131 forks source link

Pool gives away connection more than :max-total-per-key which later cause errors #197

Closed visibletrap closed 7 years ago

visibletrap commented 7 years ago

This happens under a highly concurrent situation. And it causes these kind of error messages depending on at which point the overpool connection started being used

The issue is likely to be caused by the pool. But I want to know your thought about it. Note that I'm on carmine version 2.16.0

Below is a set reproducible code snippets.

I started with patching ConnectionPool to collect behavior data

(def conn-events (atom []))

(defrecord ConnectionPool [^GenericKeyedObjectPool pool]
  IConnectionPool
  (get-conn     [_ spec] (let [r (.borrowObject pool spec)]
                           (swap! conn-events conj :get)
                           r))
  (release-conn [_ conn] (let [r (.returnObject pool (:spec conn) conn)]
                           (swap! conn-events conj :re1)
                           r))
  (release-conn [_ conn exception] (let [r (.invalidateObject pool (:spec conn) conn)]
                                     (swap! conn-events conj :re2)
                                     r))
  java.io.Closeable
  (close [_] (.close pool)))

This is a snippet to make the errors happen.

(ns taoensso.carmine.main
  (:require [taoensso.carmine :as car :refer [wcar]]
            [taoensso.carmine.connections :refer [conn-events]]
            [taoensso.timbre :as timbre])
  (:gen-class))

(def succ (atom 0))

(defn -main [& args]
  (let [[redis-conn-count thread-count round-count] (mapv #(Long/parseLong %) args)]
    (dotimes [_ thread-count]
      (future
        (dotimes [_ round-count]
          (try
            (wcar {:pool {:max-total-per-key redis-conn-count}} (car/ping))
            (swap! succ inc)
            (catch Exception e
              (swap! conn-events conj :err)
              (timbre/error e))))))

    (shutdown-agents)
    (println "Should:" (* thread-count round-count))

    (.addShutdownHook (Runtime/getRuntime)
                      (Thread.
                        (fn []
                          (spit "out.edn" @conn-events)
                          (println "Real:" @succ))))))

I make an standalone uberjar from them and ran this command

$ java -jar standalone-uberjar.jar 256 500 200 

; redis conn count limit: 256 
: thread count: 500 
; per thread processing round: 200

; The number may be vary depending on the performance of your machine

After that, I process out.edn with this snippet

(def in (read-string (slurp "out.edn")))

(defrecord State [idx conn error evt])

(def ps
  (reduce
    (fn [acc [idx e]]
      (let [{{:keys [conn error] :as s} :state} acc
            next-conn (case e
                        :get (inc conn)
                        :re1 (dec conn)
                        :re2 (dec conn)
                        conn)
            next-err (if (= e :err)
                       (inc error)
                       error)
            next-state (assoc s :idx idx :conn next-conn :error next-err :evt e)]
        (-> acc
            (assoc :state next-state)
            (update :series conj next-state))))
    (let [start-state (->State 0 0 0 :start)]
      {:state start-state
       :series [start-state]})
    (map vector (next (range)) in)))

I found that sometimes :conn value are more than 256 which is more than the number I gave to the pool as :max-total-per-key.

(take 10 (filter (comp #(> % 256) :conn) (:series ps)))
=>
(#carmine_debug.process.State{:idx 305, :conn 257, :error 0, :evt :get}
 #carmine_debug.process.State{:idx 307, :conn 257, :error 0, :evt :get}
 #carmine_debug.process.State{:idx 313, :conn 257, :error 0, :evt :get}
 #carmine_debug.process.State{:idx 315, :conn 257, :error 0, :evt :get}
 #carmine_debug.process.State{:idx 325, :conn 257, :error 0, :evt :get}
 #carmine_debug.process.State{:idx 326, :conn 258, :error 0, :evt :get}
 #carmine_debug.process.State{:idx 327, :conn 257, :error 0, :evt :re1}
 #carmine_debug.process.State{:idx 331, :conn 257, :error 0, :evt :get}
 #carmine_debug.process.State{:idx 333, :conn 257, :error 0, :evt :get}
 #carmine_debug.process.State{:idx 334, :conn 258, :error 0, :evt :get})

(apply max (map :conn (:series ps)))
=> 317

Do you have any thoughts on this?

Sorry for the long description. If you prefer a repo to run with a single command, please let me know.

visibletrap commented 7 years ago

I've just tried specifying both :max-total-per-key and :max-total to the same value, but it's still giving away more connection than the number I specified

ptaoussanis commented 7 years ago

Hi Nuttanart, thanks for the detailed info - will try reply tonight!

visibletrap commented 7 years ago

Thank you for looking at it, Peter.

I think I might be wrong about the pool leaking connection. As further debugging, I found that when I capture ConnectionPool behavior in a different (more optimistic) way, the result connection number now doesn't go over the specified value.

(def conn-events (atom []))

(defrecord ConnectionPool [^GenericKeyedObjectPool pool]
  IConnectionPool
  (get-conn     [_ spec]
    (let [r (.borrowObject pool spec)]
      (swap! conn-events conj :get)
      r))
  (release-conn [_ conn]
    (do
      (swap! conn-events conj :re1)
      (.returnObject pool (:spec conn) conn)))
  (release-conn [_ conn exception]
    (do
      (swap! conn-events conj :re2)
      (.invalidateObject pool (:spec conn) conn)))
  java.io.Closeable
  (close [_] (.close pool)))

The -main function can still be used to reproduce the errors.

visibletrap commented 7 years ago

I finally figured out what I have done wrong!!!

I increased :max-total-per-key to a high number but I didn't change :max-idle-per-key. In a heavy load, this causes connections to be created and destroyed too often (thousands per min) which as a result causes the connection errors.

I don't know what is a proper number for :max-idle-per-key. Currently, I set it to the same value as the :max-total-per-key and it seems fine. After I set that, I don't have the connection errors anymore.