microsoft / pai

Resource scheduling and cluster management for AI
https://openpai.readthedocs.io
MIT License
2.63k stars 548 forks source link

Design: PAI Platform Error Spec, Collection and Handling #2511

Open yqwang-ms opened 5 years ago

yqwang-ms commented 5 years ago

Goals

  1. Mainly used by Platform maintainer, but may also used by Job owner, to debug more deep into platform errors.
  2. Collect, Unify and Enrich the Error inside PAI Platform itself, such as PAI components (services/tools/etc), DockerD, OS, etc.
  3. Collect Events/Logs and Metrics of PAI Platform Error for Platform maintainer analysis and diagnostic, etc
  4. Give maintainer understandable Error Message instead of raw Exception.
  5. Give maintainer fallback suggestions/solutions, etc.

Future Goals

  1. Error Reaction -> Auto Maintenance (such as Node Decommission)
fanyangCS commented 4 years ago

done.

yqwang-ms commented 4 years ago

This is error spec for platform service themselves, instead of for jobs.