volcengine / veScale

A PyTorch Native LLM Training Framework
http://vescale.xyz
Apache License 2.0
553 stars 26 forks source link

added veDeviceMesh #32

Closed MackZackA closed 4 months ago

MackZackA commented 4 months ago

This PR introduces veDeviceMesh, the device mesh API that integrates handling of submeshes and process groups in performing training with DDP, TP/SP, distributed optimizer and checkpointing. It also updates fixes and patches related to veDeviceMesh API to the repository since last PR.