[QUESTION]How is vescale zero2 implemented?

volcengine / veScale

A PyTorch Native LLM Training Framework

http://vescale.xyz

Apache License 2.0

587 stars 28 forks source link

Open starstream opened 2 weeks ago

starstream commented 2 weeks ago

How is vescale zero2 implemented? Is the distributed optimizer of megatron zero2?

starstream commented 2 weeks ago

reduce-scatter grad buffer seems to work in the same way when using distributed optimizer. where to discard grad?