jlamprou / Infini-Attention

Efficient Infinite Context Transformers with Infini-attention Pytorch Implementation + QwenMoE Implementation + Training Script + 1M context keypass retrieval
https://arxiv.org/abs/2404.07143
58 stars 5 forks source link

cuda out of memory #4

Closed riou-chen closed 4 months ago

riou-chen commented 4 months ago

how to train Qwen1.5-MoE-A2.7B in 8 A100 GPU?

jlamprou commented 4 months ago

@riou-chen The model pre-trained model may barely fit in an A100 40GB with 1 batch size , bf16 and an 8-bit optimizer. Otherwise if you have multiple A100s in parallel i suggest either mapping the model on the GPU or using FSDP with Accelerate. Lastly you could create QwenConfig with less parameters but you will have to pretrain it on billions of tokens before you get some good performance.