Closed fyang7 closed 9 months ago
We were able to train the 7b 64k model on an 8x A100 node -- all other models unfortunately require a multinode setup. We used 64 GPUs, but I expect 16 would suffice for all other models (7b 128k, 13b 64k, 13b 128k)
Thanks a lot. To confirm, A100 is with 40G or 80G memory for 7b 64k fine-tuning?
It is 8x80GB for 64k context size
Could you please clarify if this discussion is around full parameter tuning or lora based? @bloc97
We were able to train the 7b 64k model on an 8x A100 node -- all other models unfortunately require a multinode setup. We used 64 GPUs, but I expect 16 would suffice for all other models (7b 128k, 13b 64k, 13b 128k)
I ran finetune.py using 2x A100 GPUs, and both GPUs loaded up to 14g/80g. After processing the first batch, the memory usage went up to 77g/80g, and then it ran OOM when starting the second batch. Is this situation normal?
It is 8x80GB for 64k context size
Can this configuration train with a total batch size of 64 (batch_size=1, num_processes=8, gradient_accumulate_every=8)? @bloc97
It is 8x80GB for 64k context size
Can this configuration train with a total batch size of 64 (batch_size=1, num_processes=8, gradient_accumulate_every=8)? @bloc97
Yes, and if you enable more modern attention partitioning schemes like RingAttention you can even do longer context.
It is 8x80GB for 64k context size
Can this configuration train with a total batch size of 64 (batch_size=1, num_processes=8, gradient_accumulate_every=8)? @bloc97
Yes, and if you enable more modern attention partitioning schemes like RingAttention you can even do longer context.
ok, thank you! but did you use --deepspeed? or other methods to reduce GPU memory usage. I used the default settings of train.sh and 4xA100 GPUs, with batch_size=1, num_processes=4, and gradient_accumulate_every=8. this setup results in an OOM issue. could you provide the detailed configuration? thank you so much. @bloc97
I run into OOM error with default setup on 8*A100 with train.sh script, could you please share the GPU requirements for fine-tuning ?