microsoft / pai

Resource scheduling and cluster management for AI
https://openpai.readthedocs.io
MIT License
2.64k stars 548 forks source link

Need job usage report #2127

Open scarlett2018 opened 5 years ago

scarlett2018 commented 5 years ago

Queuing time, Job Status Summary, Job completion w/o system error, Long running jobs completion rate, etc. User Usage Summary, VC Usage Summary.

By time, by status, by user

USTC Xinwei had shared sample scripts offline.

image

Retry should also be considered. For failure jobs, failure reasons should also reported for ops improvement and DRI.

xudifsd commented 5 years ago

These info can be get from launcher from :9086/v1/Frameworks, if combined info from #2073 , we can calculated resource that wasted(the job finally failed/killed).

scarlett2018 commented 5 years ago

These info can be get from launcher from :9086/v1/Frameworks, if combined info from #2073 , we can calculated resource that wasted(the job finally failed/killed).

Would you like to merge #2073 with this item? what's the estimation to have both #2127 and #2073 in place? Let's combine them if it makes tracking easier.

xudifsd commented 5 years ago

No, I think #2073 is relatively easy to implement, but this requires more efforts. Let's track them in different issues.

scarlett2018 commented 4 years ago

There are some experiment for job dashboard done in PowerBI, it's time to revisit, whether these things work well for v1.x. And whether there are any new needs to better understand the overall job utilization.