Create failure modes doc

lbradstreet commented 8 years ago

See https://www.elastic.co/guide/en/elasticsearch/resiliency/current/index.html for examples

To start it off: Aeron too big messages Not enough peers to run jobs - some jobs may be unscheduled Define fault tolerance and guarantees for triggers. Exactly Once Aggregations (not side effects)

Also discuss the scenarios that we've tested with Jepsen.

lbradstreet commented 8 years ago

Theoretical issues: https://github.com/onyx-platform/onyx/issues/498

Real issues: Filter id expiry.

Configuration issues: Pending timeout being hit over and over again since it's too slow - could be helped by https://github.com/onyx-platform/onyx/issues/447 Job being killed because unhandled in lifecycle.

lbradstreet commented 8 years ago

Heartbeats Message retries but non booted clients yet Exception handling - user, cluster partitioning will boot off nodes partitioned from zookeeper quorum Aeron messages/starvation problem from Aeron. Should commit suicide on both itself and on media driver. Aeron too large messages. Impossible to recover from Pending timeout. Messages that take longer than pending timeout will never finish. Bookkeeper too large writes: unrecoverable. Bookkeeper no quorum available - cluster will keep rebooting windowed tasks. Not enough peers to run jobs - some jobs may be unscheduled Exactly once triggers - impossible, but we attempt to be approximately once.

onyx-platform / onyx

Create failure modes doc #514