microsoft / DLWorkspace

Deep Learning Workspace
Other
201 stars 75 forks source link

Expose repair messages #1210

Closed Anbang-Hu closed 4 years ago

Anbang-Hu commented 4 years ago

TODO

  1. Clean up old RepairManager/RepairManagerAgent code
  2. @Gerhut Please help expose repair messages for nodes and jobs on dashboard:

For nodes, either /GetVC or /GetVCV2 has two fields like

REPAIR_MESSAGE: "Pending repair by Administrator",
REPAIR_STATE: "OUT_OF_POOL_UNTRACKED",

For a job, /GetJobDetailV2 gives something like

repairMessage: {
  message: [
    "FATAL",
    "The job is running on unhealthy node(s): some_node1(reason1), some_node2 (reason2). Please check if it is still running as expected. Node repair is waiting for the job to finish. Please kill/finish the job as soon as possible to expedite node(s) repair.",
    ""
  ],
  timestamp: 1593127892.838783
}

I'm open to changing message format as you desire.

coveralls commented 4 years ago

Pull Request Test Coverage Report for Build 3548


Totals Coverage Status
Change from base Build 3543: 0.0%
Covered Lines: 818
Relevant Lines: 865

💛 - Coveralls
Gerhut commented 4 years ago

Did we got a full repair state list? I got one in https://user-images.githubusercontent.com/5781796/84816053-c7e1b780-afc8-11ea-986f-f09d01b4d78d.png but found OUT_OF_POOL_UNTRACKED not in the graph.

Anbang-Hu commented 4 years ago

Did we got a full repair state list? I got one in https://user-images.githubusercontent.com/5781796/84816053-c7e1b780-afc8-11ea-986f-f09d01b4d78d.png but found OUT_OF_POOL_UNTRACKED not in the graph.

@Gerhut please refer to https://github.com/microsoft/DLWorkspace/pull/1202 for the full graph.