volcengine / veScale

A PyTorch Native LLM Training Framework
http://vescale.xyz
Apache License 2.0
553 stars 26 forks source link

[DTensor&DModule&DDP&Examples] feature updates and new examples #35

Closed lichen225 closed 3 months ago

lichen225 commented 3 months ago

In this PR, we add two examples and update some features in DTensor, DModule, and DDP.

Examples

  1. 4D finetuning the llama2_3b model.
  2. 4D pretraining a mixtral MOE-based model

DTensor

  1. Update op strategies on Partialed and InterleavedSharded dtensors.
  2. Add all-to-all communications.

DModule

  1. Support factory methods for nested submodules

DDP

  1. Unblock gradient allreduce for sparse modules in DDP