OOM happened when I ran "finetune.sh" from scripts/v1_5. I used single node A100-40G x8, without nvlink to fine-tune a 7B LLaVA-1.5.
The estimated training time is ~24 hours when using the default setting. The time increase is understandable (compared with 10 hrs for A100-40G x8 with nvlink). However, when I ran around 100 steps, the OOM happened.
Warning below pop-ups frequently during training.
pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
Since fine-tuning the 7B model on A100-40G x8 with nvlink worked on your machine, I wonder if the missing nvlink caused this warning and OOM. For example, if the cache can be dequeued more frequently using nvlink rather than PCIE, thus reducing the pressure of memory consumption.
Have you tested tuning the 7B model on the machine that has A100-40G x8 without nvlink?
Hi Haotian,
OOM happened when I ran "finetune.sh" from scripts/v1_5. I used single node A100-40G x8, without nvlink to fine-tune a 7B LLaVA-1.5.
The estimated training time is ~24 hours when using the default setting. The time increase is understandable (compared with 10 hrs for A100-40G x8 with nvlink). However, when I ran around 100 steps, the OOM happened.
Warning below pop-ups frequently during training.
Since fine-tuning the 7B model on A100-40G x8 with nvlink worked on your machine, I wonder if the missing nvlink caused this warning and OOM. For example, if the cache can be dequeued more frequently using nvlink rather than PCIE, thus reducing the pressure of memory consumption.
Have you tested tuning the 7B model on the machine that has A100-40G x8 without nvlink?
The training script is attached below.
Thanks,
Zilun