Open HireTheHero opened 10 months ago
请问您解决了这个问题了吗
Asking me if I've solved this problem or not? No. Also trying 4*V100 but not working either.
It looks like you are training with a A100 40GB?
If that's the case, you need to reduce the per_device_train_batch_size
:
--per_device_train_batch_size 8 \
--gradient_accumulation_steps 16 \
To keep the global batch size at 128, you will have to update the gradient_accumulation_steps
as well.
GLOBAL_BATCH_SIZE = NUM_GPUS PER_DEVICE_TRAIN_BATCH_SIZE GRADIENT_ACCUMULATION_STEPS
Describe the issue
Issue:
After tens of batches, OOM shows when I try to fine-tune LLaVA 1.5 on single A100 w/ QLoRA and cpu-offloading.
Command:
Log:
Screenshots:
N/A