Onyx never recovers from a transactor going down

onyx-platform / onyx-datomic

Onyx plugin for Datomic

http://www.onyxplatform.org/

Eclipse Public License 1.0

37 stars 14 forks source link

Onyx never recovers from a transactor going down #23

Open greywolve opened 8 years ago

greywolve commented 8 years ago

We often have our transactor failover in production, and whenever this happens, we end up having to restart Onyx, which gets stuck. I suspect this is because the current datomic connection is never given a chance to be renewed. It's better to always call (d/connect db-uri) to get a fresh connection each time. There is no performance penalty here, because datomic caches the current connection, but should the connection drop, this is the only way to re-establish a new one. I'm happy to give you a PR for this.

MichaelDrogalis commented 8 years ago

We may be better of placing an example lifecycle exception handler that restarts the job on a failed Datomic write. While it's tempting to try and re-up the connection every time, I'm a little skittish about having Onyx re-established failed connections because things like backpressure are in play here. @lbradstreet Thoughts?

greywolve commented 8 years ago

@MichaelDrogalis could you give me an example of restarting a job from a lifecycle function?

lbradstreet commented 8 years ago

@greywolve @MichaelDrogalis I'm on the fence. What @greywolve is suggesting is probably reasonable seeing it caches the connection anyway.

That said, my best guess is that your problem is still due to "bad things happening" under certain circumstances, and the transact never being notified whether a write has failed or not, and thus never deref'ing since there is no timeout on the transact. Could we try adding a parameter for a write timeout, at which point it'll reboot the peer, and see if that fixes your problems in production?

MichaelDrogalis commented 8 years ago

@lbradstreet @greywolve I don't disagree that re-upping the connection each time is a totally reasonable thing to do -- I just believe that it doesn't truly solve the problem.

@greywolve See this section, plus the cheat sheet link, and the example below: http://www.onyxplatform.org/docs/user-guide/0.9.x/#_handle_exception

Also see this test: https://github.com/onyx-platform/onyx/blob/0.9.x/test/onyx/peer/lifecycle_exception_test.clj

Let me know if it's still not clear after reading those, we can cook up a new example.

lbradstreet commented 8 years ago

@MichaelDrogalis Yup, the main point I was trying to make is that not having a timeout on the future that the transact returns is probably the main issue.

@greywolve I had actually created this branch a while back to implement this solution https://github.com/onyx-platform/onyx-datomic/pull/24. Could you try out this solution with your code? I have hard coded the write timeout to 15s, after which it will reboot the peer. If you are able to confirm that it fixes the problem I will pull the timeout into the task-map and document it and cut a release. You can try a snapshot with version 0.9.12.1-20161107.135121-2, or pull the branch and do it yourself.

greywolve commented 7 years ago

@lbradstreet sorry I have been on leave. I actually tried to kill the transactor and bring it back up again, and Onyx carried on fine once I brought it back up (locally, with transactor dev storage). So it seems the default timeout is kicking in. The only Datomic API call which doesn't do this afaik is d/sync, which you don't need to use anywhere.

That being said, I will try out your code when I can get a staging env up that I can perform a transactor failover on, which is a bit different to just killing and upping the transactor. Once I can reproduce the bug like that, then we can try out this code. Thanks though, I'll hopefully get time to test it out soon. :) (You might still be right about the timeout, maybe it doesn't time out during the special case of a transactor failover)

lbradstreet commented 7 years ago

@greywolve thanks for the continuing investigation. Hopefully we can get a reproducible test case, because it'll be hard to get a sure-fire fix without it. I wonder if we could memory starve the transactor a bit to try to trigger such cases.

MichaelDrogalis commented 7 years ago

Thanks for the update @greywolve. :)