microsoft / pai

Resource scheduling and cluster management for AI
https://openpai.readthedocs.io
MIT License
2.64k stars 548 forks source link

How to auto monitor the status of tasks. #3602

Closed IamSunGuangzhi closed 5 years ago

IamSunGuangzhi commented 5 years ago

Short summary about the issue/question: I wan to auto monitor the status of tasks. When the tasks happen error, I can get the error right away. And don`t need add the alert manually, when i submit a job. For example, pai can support the alert of dingding。

OpenPAI Environment:

Anything else we need to know: NULL

IamSunGuangzhi commented 5 years ago

how to support the task error alerts! grafana? alertmanager?

scarlett2018 commented 5 years ago

@IamSunGuangzhi - thanks for raising the feature request. is this request for an PAI end user's daily training job or for a PAI admin?

IamSunGuangzhi commented 5 years ago

Thanks for your reply, @scarlett2018 . This request is for an PAI end user's daily training job. Because PAI is training platform. PAI can auto monitor the status of tasks, which facilitates task debugging.

scarlett2018 commented 5 years ago

OpenPAI does not have plan to support dingding alike IM integration. But we could think of providing status change subscription for email address, or status feed, etc. Adding to the backlog for feature design and discussion first.

yqwang-ms commented 5 years ago

Reasonable feature request, but it is not in our planning yet. We may leverage https://www.elastic.co/what-is/elasticsearch-alerting, alertmanager, or something like https://github.com/bitnami-labs/kubewatch.

For now, you can achieve this by yourself, such as, polling the RestServer and send alert on some conditions.

IamSunGuangzhi commented 5 years ago

OK, thanks @scarlett2018 @yqwang-ms . I will try.

scarlett2018 commented 5 years ago

Thanks, closing the issue as answer had been taken. Please few free to reopen if you meet any issue while applying the suggestions.