microsoft / pai

Resource scheduling and cluster management for AI
https://openpai.readthedocs.io
MIT License
2.62k stars 548 forks source link

OpenPAI Backlog #4512

Open scarlett2018 opened 4 years ago

scarlett2018 commented 4 years ago

This issue is a long term backlog for planning and discussion. Please feel free to add "the top in your mind" to this issue directly.

Last Updated on 02/01/2021

Top Focus Scenario


Multi-Cloud/Multi-Cluster

Job management

AutoScaler

Engineering Excellent

More examples for the current examples

Add user team support

Installation experience

fanyangCS commented 4 years ago

https://github.com/microsoft/pai/issues/3872

scarlett2018 commented 3 years ago

History Backlog Info Backup, only for reference.

Brainstorming on 2020/09/10

~[Planned in #4898 ] Cell as sku in hived scheduler.~

~[Planned in #4898] Support dynamic sku types in different vc.~

@hzy46 : ~[Planned in #4898] pod/event watcher to support #4649~ ~[Planned in #4898 ] multi cluster management detailed design & review~

@suiguoxin: #4789 ~[Planned in #4898] Alert-manager: Kill low-gpu-utilization jobs, tag abnormal jobs, Cordon node with k8s API when GPU GCC Error~ ~[Planned in #4898] Job tags : DB table, rest-server API, web-portal refactor~

@yiyione: ~[Planned in #4898] Group management page in webportal~ ~[Planned in #4898] VC request management for user and admin~

@debuggy:

Per Task Retry History

Brainstorming on 5/14 - last updated at 5/28

Multi-Cloud/Multi-Cluster
@hzy46

AutoScaler @ydye

Support for large scale cluster

GPU Utilization

More examples for the current examples

Add user team support

Surfacing more backend error to users in Job Details Page #4649 @qfyin ,@yqwang-ms potential failure case: storage mount failed

Installation experience

These are incomplete items from v0.17


Engineering Excellent


Low Priority Postponed items