upperwal / EntangledMPI

Fault Tolerance framework for High Performance Computing [Supports ULFM, replication and checkpointing]
MIT License
2 stars 1 forks source link

Process Manager #17

Closed upperwal closed 6 years ago

upperwal commented 6 years ago

Process Manager will be responsible for:

  1. Predicting the node failure probability (future)
  2. Creating the replication map for the current job.
  3. Keep an eye on the died nodes to remove them from the replication map
  4. Updating the replication map in set time interval
upperwal commented 6 years ago

23 Review Required

upperwal commented 6 years ago

Not that good but good to go.