volcengine / veScale

A PyTorch Native LLM Training Framework
http://vescale.xyz
Apache License 2.0
587 stars 28 forks source link

[QUESTION]How is vescale zero2 implemented? #54

Open starstream opened 2 weeks ago

starstream commented 2 weeks ago

How is vescale zero2 implemented? Is the distributed optimizer of megatron zero2?

starstream commented 2 weeks ago

reduce-scatter grad buffer seems to work in the same way when using distributed optimizer. where to discard grad?