Closed IamSunGuangzhi closed 5 years ago
how to support the task error alerts! grafana? alertmanager?
@IamSunGuangzhi - thanks for raising the feature request. is this request for an PAI end user's daily training job or for a PAI admin?
Thanks for your reply, @scarlett2018 . This request is for an PAI end user's daily training job. Because PAI is training platform. PAI can auto monitor the status of tasks, which facilitates task debugging.
OpenPAI does not have plan to support dingding alike IM integration. But we could think of providing status change subscription for email address, or status feed, etc. Adding to the backlog for feature design and discussion first.
Reasonable feature request, but it is not in our planning yet. We may leverage https://www.elastic.co/what-is/elasticsearch-alerting, alertmanager, or something like https://github.com/bitnami-labs/kubewatch.
For now, you can achieve this by yourself, such as, polling the RestServer and send alert on some conditions.
OK, thanks @scarlett2018 @yqwang-ms . I will try.
Thanks, closing the issue as answer had been taken. Please few free to reopen if you meet any issue while applying the suggestions.
Short summary about the issue/question: I wan to auto monitor the status of tasks. When the tasks happen error, I can get the error right away. And don`t need add the alert manually, when i submit a job. For example, pai can support the alert of dingding。
OpenPAI Environment:
Anything else we need to know: NULL