hydro-project / fluent

A data-driven compute platform
Apache License 2.0
1.22k stars 173 forks source link

Fixes node failure protocol #43

Closed vsreekanti closed 6 years ago

vsreekanti commented 6 years ago

Resolves #2.

Periodically checks for one of two conditions:

  1. If there is a node that the cluster thinks exists that has crashed. If so, it broadcasts to all other nodes that that node has left the cluster. It does not force other nodes to gossip their data because a new node will eventually come online and become responsible for some subset of the data.
  2. Kubernetes will spin up a new node to replace the crashed node. If we see a spun up node that doesn't have a pod running on it, we schedule a new pod for that machine. This pod enters the standard node join protocol.

More details here.

codecov[bot] commented 6 years ago

Codecov Report

Merging #43 into master will not change coverage. The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master      #43   +/-   ##
=======================================
  Coverage   62.44%   62.44%           
=======================================
  Files          50       50           
  Lines        1462     1462           
=======================================
  Hits          913      913           
  Misses        549      549

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 4a3da47...d293971. Read the comment docs.