volcengine / veScale

A PyTorch Native LLM Training Framework
http://vescale.xyz
Apache License 2.0
553 stars 26 forks source link

[DTensor & DModule & DOptim] feature updates #20

Closed jc-bytedance closed 4 months ago

jc-bytedance commented 4 months ago

In this PR, we update some features in our DTensor & DModule & DOptim implementations, Yo~

DTensor Updates:

  1. Support more dtensor ops.
  2. Sharding Strategy Updates.

DModule Updates:

  1. Decouple uneven support and run check.
  2. Reduce some CPU overhead.

DOptim Updates:

  1. More fridenly API.
  2. Unit test updates.
  3. Reorder some communication for better results.

Other Updates/fixes:

  1. Some minor update on our nano GPT model and test results.