intelligent-machine-learning / dlrover

DLRover: An Automatic Distributed Deep Learning System
Other
1.26k stars 163 forks source link

Develop algorithms for auto-tuning both GPU memory usage and training performance. #470

Open workingloong opened 1 year ago

workingloong commented 1 year ago

Making FSDP auto-tune. There are many knobs that users can tune today with FSDP for both scaling and performance.

github-actions[bot] commented 11 hours ago

This issue has been automatically marked as stale because it has not had recent activity.