A recent paper titled "GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection" (https://arxiv.org/pdf/2403.03507.pdf) demonstrates a remarkable memory-efficient approach during the training of large language models (LLMs).
Can we integrate this memory-efficient technique into the Colossalai framework?
Describe the feature
A recent paper titled "GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection" (https://arxiv.org/pdf/2403.03507.pdf) demonstrates a remarkable memory-efficient approach during the training of large language models (LLMs).
Can we integrate this memory-efficient technique into the Colossalai framework?
FYI