Open starstream opened 2 weeks ago
How is vescale zero2 implemented? Is the distributed optimizer of megatron zero2?
reduce-scatter grad buffer seems to work in the same way when using distributed optimizer. where to discard grad?
How is vescale zero2 implemented? Is the distributed optimizer of megatron zero2?