taoensso / carmine

Redis client + message queue for Clojure
https://www.taoensso.com/carmine
Eclipse Public License 1.0
1.15k stars 130 forks source link

Rare error in queue worker #173

Closed uaalto closed 8 years ago

uaalto commented 8 years ago

Doesn't happen consistently, but appears in logs some times. Doesn't seem related to user code in any way.

2016-06-21T14:43:19.699344+00:00 app[worker.1]: 16-06-21 14:43:19 f45c483f-1e2a-484b-b611-1e6cb10a75ef ERROR [taoensso.carmine.message-queue:272] - Worker error! Will backoff & retry.
2016-06-21T14:43:19.699354+00:00 app[worker.1]:                                                  java.lang.Thread.run              Thread.java:  745
2016-06-21T14:43:19.699355+00:00 app[worker.1]:                    java.util.concurrent.ThreadPoolExecutor$Worker.run  ThreadPoolExecutor.java:  617
2016-06-21T14:43:19.699356+00:00 app[worker.1]:                                   java.util.concurrent.FutureTask.run          FutureTask.java:  266
2016-06-21T14:43:19.699355+00:00 app[worker.1]:                     java.util.concurrent.ThreadPoolExecutor.runWorker  ThreadPoolExecutor.java: 1142
2016-06-21T14:43:19.699358+00:00 app[worker.1]:                                   clojure.core/binding-conveyor-fn/fn                 core.clj: 1938
2016-06-21T14:43:19.699357+00:00 app[worker.1]:                                                                   ...                               
2016-06-21T14:43:19.699359+00:00 app[worker.1]:             taoensso.carmine.message-queue.Worker/start-polling-loop!        message_queue.clj:  243
2016-06-21T14:43:19.699360+00:00 app[worker.1]:          taoensso.carmine.message-queue.Worker/start-polling-loop!/fn        message_queue.clj:  245
2016-06-21T14:43:19.699361+00:00 app[worker.1]:       taoensso.carmine.message-queue.Worker/start-polling-loop!/fn/fn        message_queue.clj:  245
2016-06-21T14:43:19.699358+00:00 app[worker.1]:                           taoensso.carmine.message-queue.Worker/fn/fn        message_queue.clj:  276
2016-06-21T14:43:19.699366+00:00 app[worker.1]:                                                    clojure.core/apply                 core.clj:  650
2016-06-21T14:43:19.699362+00:00 app[worker.1]:                               taoensso.carmine.protocol/-with-replies             protocol.clj:  335
2016-06-21T14:43:19.699367+00:00 app[worker.1]:                                                                   ...                               
2016-06-21T14:43:19.699363+00:00 app[worker.1]: taoensso.carmine.message-queue.Worker/start-polling-loop!/fn/fn/fn/fn        message_queue.clj:  246
2016-06-21T14:43:19.699364+00:00 app[worker.1]:                                  taoensso.carmine.message-queue/fn/fn        message_queue.clj:  147
2016-06-21T14:43:19.699364+00:00 app[worker.1]:                                                                   ...                               
2016-06-21T14:43:19.699361+00:00 app[worker.1]:    taoensso.carmine.message-queue.Worker/start-polling-loop!/fn/fn/fn        message_queue.clj:  245
2016-06-21T14:43:19.699368+00:00 app[worker.1]:                                                taoensso.carmine/eval*              carmine.clj:  160
2016-06-21T14:43:19.699365+00:00 app[worker.1]:                                                  taoensso.carmine/lua              carmine.clj:  218
2016-06-21T14:43:19.699368+00:00 app[worker.1]:                                                taoensso.carmine/eval*              carmine.clj:  170
2016-06-21T14:43:19.699370+00:00 app[worker.1]:                               taoensso.carmine.protocol/-with-replies             protocol.clj:  337
2016-06-21T14:43:19.699374+00:00 app[worker.1]:                            taoensso.carmine.protocol/get-parsed-reply             protocol.clj:  206
2016-06-21T14:43:19.699372+00:00 app[worker.1]:                         taoensso.carmine.protocol/execute-requests/fn             protocol.clj:  304
2016-06-21T14:43:19.699371+00:00 app[worker.1]:                                                   clojure.core/reduce                 core.clj: 6544
2016-06-21T14:43:19.699369+00:00 app[worker.1]:                                             taoensso.carmine/eval*/fn              carmine.clj:  170
2016-06-21T14:43:19.699372+00:00 app[worker.1]:                                                                   ...                               
2016-06-21T14:43:19.699370+00:00 app[worker.1]:                            taoensso.carmine.protocol/execute-requests             protocol.clj:  302
2016-06-21T14:43:19.699377+00:00 app[worker.1]: java.io.EOFException: 
2016-06-21T14:43:19.699375+00:00 app[worker.1]:                                      java.io.DataInputStream.readByte     DataInputStream.java:  267
2016-06-21T14:43:19.699378+00:00 app[worker.1]: 
2016-06-21T14:43:19.699374+00:00 app[worker.1]:                          taoensso.carmine.protocol/get-unparsed-reply             protocol.clj:  123
ptaoussanis commented 8 years ago

Hi there,

What version of Carmine?

uaalto commented 8 years ago

Hi!

Latest version: "2.13.0" I haven't seen this error in previous versions.

A new thing that seems to be happening too is an error I'm still investigating so I cannot tell for sure if it's internal or caused by my code:

Jun 21 08:18:36 quiet-island-5559 app/worker.1:  #error { 
Jun 21 08:18:36 quiet-island-5559 app/worker.1:   :cause nil 
Jun 21 08:18:36 quiet-island-5559 app/worker.1:   :via 
Jun 21 08:18:36 quiet-island-5559 app/worker.1:   [{:type java.io.EOFException 
Jun 21 08:18:36 quiet-island-5559 app/worker.1:     :message nil 
Jun 21 08:18:36 quiet-island-5559 app/worker.1:     :at [java.io.DataInputStream readByte DataInputStream.java 267]}] 
Jun 21 08:18:36 quiet-island-5559 app/worker.1:   :trace 
Jun 21 08:18:36 quiet-island-5559 app/worker.1:   [[java.io.DataInputStream readByte DataInputStream.java 267] 
Jun 21 08:18:36 quiet-island-5559 app/worker.1:    [taoensso.carmine.protocol$get_unparsed_reply invokeStatic protocol.clj 123] 
Jun 21 08:18:36 quiet-island-5559 app/worker.1:    [taoensso.carmine.protocol$get_unparsed_reply invoke protocol.clj 107] 
Jun 21 08:18:36 quiet-island-5559 app/worker.1:    [taoensso.carmine.protocol$get_parsed_reply invokeStatic protocol.clj 206] 
Jun 21 08:18:36 quiet-island-5559 app/worker.1:    [taoensso.carmine.protocol$get_parsed_reply invoke protocol.clj 196] 
Jun 21 08:18:36 quiet-island-5559 app/worker.1:    [taoensso.carmine.protocol$execute_requests invokeStatic protocol.clj 315] 
Jun 21 08:18:36 quiet-island-5559 app/worker.1:    [taoensso.carmine.protocol$execute_requests invoke protocol.clj 279] 
Jun 21 08:18:36 quiet-island-5559 app/worker.1:    [taoensso.carmine.protocol$_with_replies invokeStatic protocol.clj 337] 
Jun 21 08:18:36 quiet-island-5559 app/worker.1:    [taoensso.carmine.protocol$_with_replies invoke protocol.clj 323] 
Jun 21 08:18:36 quiet-island-5559 app/worker.1:    [pro_server.redis$skip_expiration_task_QMARK_$fn__13046$fn__13047 invoke redis.clj 161] 
Jun 21 08:18:36 quiet-island-5559 app/worker.1:    [pro_server.redis$skip_expiration_task_QMARK_$fn__13046 invoke redis.clj 161] 
Jun 21 08:18:36 quiet-island-5559 app/worker.1:    [pro_server.redis$skip_expiration_task_QMARK_ invokeStatic redis.clj 161] 
Jun 21 08:18:36 quiet-island-5559 app/worker.1:    [pro_server.redis$skip_expiration_task_QMARK_ invoke redis.clj 160] 
Jun 21 08:18:37 quiet-island-5559 app/worker.1:    [pro_server.redis$user_expirer_aux invokeStatic redis.clj 877] 
Jun 21 08:18:37 quiet-island-5559 app/worker.1:    [pro_server.redis$user_expirer_aux invoke redis.clj 873] 
Jun 21 08:18:37 quiet-island-5559 app/worker.1:    [pro_server.redis$make_user_expire_worker$fn__13816 invoke redis.clj 918] 
Jun 21 08:18:37 quiet-island-5559 app/worker.1:    [clojure.core$binding_conveyor_fn$fn__4676 invoke core.clj 1938] 
Jun 21 08:18:37 quiet-island-5559 app/worker.1:    [clojure.lang.AFn call AFn.java 18] 
Jun 21 08:18:37 quiet-island-5559 app/worker.1:    [java.util.concurrent.FutureTask run FutureTask.java 266] 
Jun 21 08:18:37 quiet-island-5559 app/worker.1:    [java.util.concurrent.ThreadPoolExecutor runWorker ThreadPoolExecutor.java 1142] 
Jun 21 08:18:37 quiet-island-5559 app/worker.1:    [java.util.concurrent.ThreadPoolExecutor$Worker run ThreadPoolExecutor.java 617] 
Jun 21 08:18:37 quiet-island-5559 app/worker.1:    [java.lang.Thread run Thread.java 745]]} 

Even though is triggered at a library user function, it shouldn't happen and doesn't actually happen most of the time.

This is the function:

(defn skip-expiration-task? []
  (= 1 (r/wcar redis-conn
               (r/hexists "settings" "skip-expiration-task"))))
ptaoussanis commented 8 years ago

Latest version: "2.13.0". I haven't seen this error in previous versions.

Thanks, that's helpful. Will get back to you.

ptaoussanis commented 8 years ago

Okay, may I ask you to try with [com.taoensso/carmine "2.13.1-uaalto"] (just pushed to Clojars)?

uaalto commented 8 years ago

Thank you! Right away. I'll leave this running for a while, since the error is not immediate. I'll get back to you.

ptaoussanis commented 8 years ago

No problem, take your time. Appreciate the report.

ptaoussanis commented 8 years ago

Pretty sure that this should be resolved, and about to head to sleep - so have cut [com.taoensso/carmine "2.13.1"] for general release at https://github.com/ptaoussanis/carmine/releases/tag/v2.13.1

Please do let me know though if you still run into trouble and I can take another look tomorrow.

Best of luck, cheers!

uaalto commented 8 years ago

It's been running without issues for the last 30 mins. Just wanted to wait for longer, but looks great! Thank you very much for your help and quick response @ptaoussanis !

uaalto commented 8 years ago

This one seems related and has happened after the update:

2016-06-21T17:06:40.954411+00:00 app[worker.1]:  :via
2016-06-21T17:06:40.954412+00:00 app[worker.1]:  [{:type java.io.EOFException
2016-06-21T17:06:40.954399+00:00 app[worker.1]: #error {
2016-06-21T17:06:40.954410+00:00 app[worker.1]:  :cause nil
2016-06-21T17:06:40.954419+00:00 app[worker.1]:   [taoensso.carmine.protocol$_with_replies invokeStatic protocol.clj 339]
2016-06-21T17:06:40.954420+00:00 app[worker.1]:   [taoensso.carmine.protocol$_with_replies invoke protocol.clj 325]
2016-06-21T17:06:40.954421+00:00 app[worker.1]:   [server.redis$skip_expiration_task_QMARK_$fn__12866$fn__12867 invoke redis.clj 161]
2016-06-21T17:06:40.954422+00:00 app[worker.1]:   [server.redis$skip_expiration_task_QMARK_$fn__12866 invoke redis.clj 161]
2016-06-21T17:06:40.954422+00:00 app[worker.1]:   [server.redis$skip_expiration_task_QMARK_ invokeStatic redis.clj 161]
2016-06-21T17:06:40.954423+00:00 app[worker.1]:   [server.redis$skip_expiration_task_QMARK_ invoke redis.clj 160]
2016-06-21T17:06:40.954423+00:00 app[worker.1]:   [server.redis$user_expirer_aux$fn__13578 invoke redis.clj 878]
2016-06-21T17:06:40.954424+00:00 app[worker.1]:   [server.redis$user_expirer_aux invokeStatic redis.clj 877]
2016-06-21T17:06:40.954425+00:00 app[worker.1]:   [server.redis$user_expirer_aux invoke redis.clj 873]
2016-06-21T17:06:40.954416+00:00 app[worker.1]:   [taoensso.carmine.protocol$get_unparsed_reply invoke protocol.clj 107]
2016-06-21T17:06:40.954417+00:00 app[worker.1]:   [taoensso.carmine.protocol$get_parsed_reply invokeStatic protocol.clj 208]
2016-06-21T17:06:40.954415+00:00 app[worker.1]:  [[java.io.DataInputStream readByte DataInputStream.java 267]
2016-06-21T17:06:40.954415+00:00 app[worker.1]:   [taoensso.carmine.protocol$get_unparsed_reply invokeStatic protocol.clj 123]
2016-06-21T17:06:40.954412+00:00 app[worker.1]:    :message nil
2016-06-21T17:06:40.954413+00:00 app[worker.1]:    :at [java.io.DataInputStream readByte DataInputStream.java 267]}]
2016-06-21T17:06:40.954414+00:00 app[worker.1]:  :trace
2016-06-21T17:06:40.954418+00:00 app[worker.1]:   [taoensso.carmine.protocol$get_parsed_reply invoke protocol.clj 198]
2016-06-21T17:06:40.954418+00:00 app[worker.1]:   [taoensso.carmine.protocol$execute_requests invokeStatic protocol.clj 317]
2016-06-21T17:06:40.954419+00:00 app[worker.1]:   [taoensso.carmine.protocol$execute_requests invoke protocol.clj 281]
2016-06-21T17:06:40.954425+00:00 app[worker.1]:   [server.redis$make_user_expire_worker$fn__13638 invoke redis.clj 921]
2016-06-21T17:06:40.954426+00:00 app[worker.1]:   [clojure.core$binding_conveyor_fn$fn__4676 invoke core.clj 1938]
2016-06-21T17:06:40.954427+00:00 app[worker.1]:   [clojure.lang.AFn call AFn.java 18]
2016-06-21T17:06:40.954427+00:00 app[worker.1]:   [java.util.concurrent.FutureTask run FutureTask.java 266]
2016-06-21T17:06:40.954428+00:00 app[worker.1]:   [java.util.concurrent.ThreadPoolExecutor runWorker ThreadPoolExecutor.java 1142]
2016-06-21T17:06:40.954429+00:00 app[worker.1]:   [java.util.concurrent.ThreadPoolExecutor$Worker run ThreadPoolExecutor.java 617]
2016-06-21T17:06:40.954429+00:00 app[worker.1]:   [java.lang.Thread run Thread.java 745]]}
ptaoussanis commented 8 years ago

A few questions:

  1. You're absolutely certain you're using [com.taoensso/carmine "2.13.1"] or [com.taoensso/carmine "2.13.1-uaalto"]? Can't spot any other obvious causes of this error after the hotfix. UPDATE: Suggest running lein clean to help make sure.
  2. If you drop back to [com.taoensso/carmine "2.12.2"] and make no other changes, this error definitely goes away?
  3. No other changes (e.g. to your network, Redis config, etc.) that may be having an effect / causing your Redis connections to drop unexpectedly?
  4. Can I see your redis-conn config as given to wcar, etc.?

Will check back in the morning with fresh eyes.

uaalto commented 8 years ago

This one is with 2.13.1. I'll double-check everything else. Thank you!

Since I'm deploying in Heroku, everytime it's created in a new environment and the dependencies are loaded from scratch.

2016-06-21T19:28:55.727237+00:00 app[web.1]: Retrieving com/taoensso/carmine/2.13.1/carmine-2.13. ...
uaalto commented 8 years ago

It seems that with 2.12.2, this happens too. It was just harder to reproduce (had to observe the process longer to see it happen).

Jun 21 13:52:08 app/worker.1:  #error { 
Jun 21 13:52:08 app/worker.1:   :cause nil 
Jun 21 13:52:08 app/worker.1:   :via 
Jun 21 13:52:08 app/worker.1:   [{:type java.io.EOFException 
Jun 21 13:52:08 app/worker.1:     :message nil 
Jun 21 13:52:08 app/worker.1:     :at [java.io.DataInputStream readByte DataInputStream.java 267]}] 
Jun 21 13:52:08 app/worker.1:   :trace 
Jun 21 13:52:08 app/worker.1:   [[java.io.DataInputStream readByte DataInputStream.java 267] 
Jun 21 13:52:08 app/worker.1:    [taoensso.carmine.protocol$get_unparsed_reply invokeStatic protocol.clj 139] 
Jun 21 13:52:08 app/worker.1:    [taoensso.carmine.protocol$get_unparsed_reply invoke protocol.clj 123] 
Jun 21 13:52:08 app/worker.1:    [taoensso.carmine.protocol$get_parsed_reply invokeStatic protocol.clj 214] 
Jun 21 13:52:08 app/worker.1:    [taoensso.carmine.protocol$get_parsed_reply invoke protocol.clj 204] 
Jun 21 13:52:08 app/worker.1:    [taoensso.carmine.protocol$execute_requests invokeStatic protocol.clj 321] 
Jun 21 13:52:08 app/worker.1:    [taoensso.carmine.protocol$execute_requests invoke protocol.clj 287] 
Jun 21 13:52:08 app/worker.1:    [taoensso.carmine.protocol$with_replies_STAR_ invokeStatic protocol.clj 343] 
Jun 21 13:52:08 app/worker.1:    [taoensso.carmine.protocol$with_replies_STAR_ invoke protocol.clj 329] 
Jun 21 13:52:08 app/worker.1:    [pro_server.redis$skip_expiration_task_QMARK_$fn__12902$fn__12903 invoke redis.clj 161] 
Jun 21 13:52:08 app/worker.1:    [pro_server.redis$skip_expiration_task_QMARK_$fn__12902 invoke redis.clj 161] 
Jun 21 13:52:08 app/worker.1:    [pro_server.redis$skip_expiration_task_QMARK_ invokeStatic redis.clj 161] 
Jun 21 13:52:08 app/worker.1:    [pro_server.redis$skip_expiration_task_QMARK_ invoke redis.clj 160] 
Jun 21 13:52:08 app/worker.1:    [pro_server.redis$user_expirer_aux$fn__13614 invoke redis.clj 878] 
Jun 21 13:52:08 app/worker.1:    [pro_server.redis$user_expirer_aux invokeStatic redis.clj 877] 
Jun 21 13:52:08 app/worker.1:    [pro_server.redis$user_expirer_aux invoke redis.clj 873] 
Jun 21 13:52:08 app/worker.1:    [pro_server.redis$make_user_expire_worker$fn__13674 invoke redis.clj 921] 
Jun 21 13:52:08 app/worker.1:    [clojure.core$binding_conveyor_fn$fn__4676 invoke core.clj 1938] 
Jun 21 13:52:08 app/worker.1:    [clojure.lang.AFn call AFn.java 18] 
Jun 21 13:52:08 app/worker.1:    [java.util.concurrent.FutureTask run FutureTask.java 266] 
Jun 21 13:52:08 app/worker.1:    [java.util.concurrent.ThreadPoolExecutor runWorker ThreadPoolExecutor.java 1142] 
Jun 21 13:52:08 app/worker.1:    [java.util.concurrent.ThreadPoolExecutor$Worker run ThreadPoolExecutor.java 617] 
Jun 21 13:52:08 app/worker.1:    [java.lang.Thread run Thread.java 745]]} 

It's hard to reproduce, it doesn't happen often and I can't link it to any complex logic in my code, but the function I wrote above seems to be the cause. Usually there is no key there, so that might be a clue.

ptaoussanis commented 8 years ago

It seems that with 2.12.2, this happens too.

Okay, thanks for the confirmation.

(defn skip-expiration-task? []
  (= 1 (r/wcar redis-conn
               (r/hexists "settings" "skip-expiration-task"))))

This function shouldn't cause any problems.

And have looked over the implementation again as of v2.13.1 and can't spot any other obvious potential causes of the error you're seeing. The change I made in v2.13.1 could certainly explain the problem; but if it's still happening (and even used to happen in v2.12.1), then I'm out of ideas.

Some debug suggestions:

  1. Try [com.taoensso/carmine "2.13.2-uaalto"], a long shot - don't think this will help but might be worth trying.
  2. Try catch this EOFException; when it happens - check on your Redis server's condition. Has the server possibly gone down?
  3. Are you using a connection pool? With what settings? Could something be corrupting your TCP connections? Can I see your redis-conn config as given to wcar, etc.? Please make sure you don't have any connection timeouts configured.

Otherwise you may need to try and isolate a simpler reproducible example.

Does this only happen in queue workers? Does it tend to happen after a ~predictable amount of time or work?

Any patterns you might be able to identify might be helpful.

uaalto commented 8 years ago
  1. Testing now.
  2. I'm catching the error and retrying in 5 seconds currently, which doesn't fail. The error happens only once every ~15 mins give or take. Is not predictable.
  3. These are the settings: {:pool {} :spec {:url "redis://127.0.0.1:6379"}}. I'm using Stunnel, so yes, this is a factor to consider. We've discovered is pretty inneficient.

I think I've seen this outside the worker. But it's happening in a background task which has nothing to do with Carmine's reliable queues, so I would say the answer is no, it happens outside queue workers too. Approx every 15-20mins. but is not predictable. I'd say that the previous hotfixed increased the interval of the event. Can't really say.

The function that fails just happens to be the first one that makes a Redis roundtrip in the background process, so it shouldn't be related to that one.

As I write this, the error didn't happen, I'll keep you updated of the current run.

EDIT: as of this writing it just happened after running for 8mins.

ptaoussanis commented 8 years ago

I'm using Stunnel

How easy would it be to test w/o Stunnel? Otherwise: you're certain that the Stunnel connection isn't breaking? Have you checked your Stunnel logs? Redis logs?

Other ideas:

If this error was indeed showing up with [com.taoensso/carmine "2.12.2"], then it wasn't the the upgrade to v2.13.0 that started causing issues. But if there was a time you weren't seeing this problem, then presumably something else changed at some point and triggered the problem?

I'm out of ideas on Carmine's end, sorry - would suggest starting to eliminate potential environmental or application causes.

uaalto commented 8 years ago

I've been observing the server's behaviour during the past 24h. and now the error seems to have disappeared. I'll close for now, and reopen if it happens again, then I'll try to reproduce without Stunnel.

Thank you very much for your quick responses and looking into it in detail!