Open lbradstreet opened 8 years ago
Theoretical issues: https://github.com/onyx-platform/onyx/issues/498
Real issues: Filter id expiry.
Configuration issues: Pending timeout being hit over and over again since it's too slow - could be helped by https://github.com/onyx-platform/onyx/issues/447 Job being killed because unhandled in lifecycle.
Heartbeats Message retries but non booted clients yet Exception handling - user, cluster partitioning will boot off nodes partitioned from zookeeper quorum Aeron messages/starvation problem from Aeron. Should commit suicide on both itself and on media driver. Aeron too large messages. Impossible to recover from Pending timeout. Messages that take longer than pending timeout will never finish. Bookkeeper too large writes: unrecoverable. Bookkeeper no quorum available - cluster will keep rebooting windowed tasks. Not enough peers to run jobs - some jobs may be unscheduled Exactly once triggers - impossible, but we attempt to be approximately once.
See https://www.elastic.co/guide/en/elasticsearch/resiliency/current/index.html for examples
To start it off: Aeron too big messages Not enough peers to run jobs - some jobs may be unscheduled Define fault tolerance and guarantees for triggers. Exactly Once Aggregations (not side effects)
Also discuss the scenarios that we've tested with Jepsen.