volcengine / veScale

A PyTorch Native LLM Training Framework
http://vescale.xyz
Apache License 2.0
553 stars 26 forks source link

[DModule] Open Source #10

Closed leonardo0lyj closed 5 months ago

leonardo0lyj commented 5 months ago

In this PR, we open source our DModule, Yo~

What is veScale DModule?

veScale DModule (Distributed Module) provides a single-device abstraction for multiple-device nn.Module and empowers user to write distributed training/inference code as if on a single device (i.e., SPMD).

DModule unifies Module-level Tensor Parallelism and Sequence Parallelism by transparently handling distributed logic under the hood:

Difference of veScale DModule from PyTorch parallelize_module?

Credit to veScale DModule Team

This endeavor would not have been possible without the contribution of our DModule team which includes but not limited to: @SerailHydra @Vremold @JsBlueCat @jc-bytedance @MackZackA @leonardo0lyj.

Also thanks to the great guidance and leadership of: @liwenchangbdbz @pengyanghua @eric-haibin-lin @Meteorix

Credit to PyTorch DTensor Team

Once again, we would like to sincerely acknowledge the assistance of and collaboration with the PyTorch DTensor team which includes but not limited to: @wanchaol @XilunWu @wz337 @tianyu-l @fduwjj @awgu @yifuwang @wconstab @ezyang @mrshenli.