microsoft / pai

Resource scheduling and cluster management for AI
https://openpai.readthedocs.io
MIT License
2.63k stars 548 forks source link

Enrich job debugging info #4649

Open yqwang-ms opened 4 years ago

yqwang-ms commented 4 years ago

More job debugging info:

  1. Current Task Status Per task exit info, retry count, nodename, retry history, etc Resolves: https://github.com/microsoft/pai/issues/4348, https://github.com/microsoft/pai/issues/4323
  2. Current K8S Event Resolves: https://github.com/microsoft/pai/issues/3572
  3. History Task Status
  4. History K8S Event
debuggy commented 4 years ago

4667 #4670

fanyangCS commented 4 years ago

relate to https://github.com/microsoft/pai/issues/3572 and https://github.com/microsoft/pai/issues/4141