onyx-platform / onyx

Distributed, masterless, high performance, fault tolerant data processing
http://www.onyxplatform.org
Eclipse Public License 1.0
2.05k stars 205 forks source link

handle-exception lifecycle doesn't appear to apply to write-batch #491

Closed lbradstreet closed 8 years ago

lbradstreet commented 8 years ago

Found by jepsen. I switched out :onyx/restart-pred-fn for a restart lifecycle and the job still seems to be being killed under certain scenarios. The main one I noticed is in write-batch. I'll look into this.

lbradstreet commented 8 years ago

Yup, write-batch isn't handled like read-batch or the others. https://github.com/onyx-platform/onyx/blob/develop/src/onyx/peer/task_lifecycle.clj#L284

lbradstreet commented 8 years ago

Hmm, I guess these should already be covered by after-batch which was placed in the task-lifecycle.

lbradstreet commented 8 years ago

Ah, it isn't covered by that because it doesn't invoke those task lifecycle calls via restartable-invocation. So that's the overall issue.

lbradstreet commented 8 years ago

@MichaelDrogalis this would be a good one to take if you have some time.

Basically, I'm not seeing handle-exception be able to handle an exception thrown in write-batch.

lbradstreet commented 8 years ago

PR #505 successfully passes Jepsen tests without :onyx/restart-pred-fn.

I think there might be additional task-lifecycle stages where we should allow restarts though. For example, build-new-segments and flow-retry-segments. I think build-new-segments isn't particularly risky, however I think that when a user really wants a job to be up no matter what, then we should cover all stages of the lifecycle.

MichaelDrogalis commented 8 years ago

I remember that we discussed this a while back and decided not to restart on exceptions thrown by internal Onyx code - e.g. nothing a user could have written. If we have a bug in Onyx, it should crash hard, because it's unlikely that there's hope of recovery for doing the right thing. I think assign-windows falls under this category.

lbradstreet commented 8 years ago

I think the distinction should be whether the function is pure or not. If we're doing stateful things, they might be transitory / fixed if the peer is restarted.

I think assign-windows, and possibly flow-retry-segments falls in this category (especially because it calls emit-latency which uses monitoring).

MichaelDrogalis commented 8 years ago

Monitoring calls, even when they fail, should never crash the job. We need to be tolerant of a failure here and there to emit metrics.

I still think that even if it's a transient, stateful issue, and it's Onyx's fault, restarting gives the user a false sense of security when something wrong is lurking.

MichaelDrogalis commented 8 years ago

Merged.