jquesnelle / yarn

YaRN: Efficient Context Window Extension of Large Language Models
MIT License
1.25k stars 110 forks source link

What the recommended GPU setup for fine-tuning ? #23

Closed fyang7 closed 9 months ago

fyang7 commented 9 months ago

I run into OOM error with default setup on 8*A100 with train.sh script, could you please share the GPU requirements for fine-tuning ?

jquesnelle commented 9 months ago

We were able to train the 7b 64k model on an 8x A100 node -- all other models unfortunately require a multinode setup. We used 64 GPUs, but I expect 16 would suffice for all other models (7b 128k, 13b 64k, 13b 128k)

fyang7 commented 9 months ago

Thanks a lot. To confirm, A100 is with 40G or 80G memory for 7b 64k fine-tuning?

bloc97 commented 9 months ago

It is 8x80GB for 64k context size

sadransh commented 9 months ago

Could you please clarify if this discussion is around full parameter tuning or lora based? @bloc97

YL-9 commented 2 months ago

We were able to train the 7b 64k model on an 8x A100 node -- all other models unfortunately require a multinode setup. We used 64 GPUs, but I expect 16 would suffice for all other models (7b 128k, 13b 64k, 13b 128k)

I ran finetune.py using 2x A100 GPUs, and both GPUs loaded up to 14g/80g. After processing the first batch, the memory usage went up to 77g/80g, and then it ran OOM when starting the second batch. Is this situation normal?

YL-9 commented 2 months ago

It is 8x80GB for 64k context size

Can this configuration train with a total batch size of 64 (batch_size=1, num_processes=8, gradient_accumulate_every=8)? @bloc97

bloc97 commented 2 months ago

It is 8x80GB for 64k context size

Can this configuration train with a total batch size of 64 (batch_size=1, num_processes=8, gradient_accumulate_every=8)? @bloc97

Yes, and if you enable more modern attention partitioning schemes like RingAttention you can even do longer context.

YL-9 commented 1 month ago

It is 8x80GB for 64k context size

Can this configuration train with a total batch size of 64 (batch_size=1, num_processes=8, gradient_accumulate_every=8)? @bloc97

Yes, and if you enable more modern attention partitioning schemes like RingAttention you can even do longer context.

ok, thank you! but did you use --deepspeed? or other methods to reduce GPU memory usage. I used the default settings of train.sh and 4xA100 GPUs, with batch_size=1, num_processes=4, and gradient_accumulate_every=8. this setup results in an OOM issue. could you provide the detailed configuration? thank you so much. @bloc97