Expect the memory usage to be low as I'm loading it in 4 bit and training it in QLoRA with Deepspeed Zero 2 across 2 T4 GPU and a CPU.
Others
It took around 12 GB (6GB for each GPU) just to load the model in 4bit!
Memory usage before OOM (during training):
Error:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.47 GiB. GPU 0 has a total capacty of 14.74 GiB of which 202.12 MiB is free. Process 16375 has 14.54 GiB memory in use. Of the allocated memory 11.94 GiB is allocated by PyTorch, and 2.37 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Reminder
System Info
Platform: Kaggle 2xT4
llamafactory
version: 0.8.4.dev0Reproduction
config:
Run
!llamafactory-cli train <path-to-config-file>
Expected behavior
Expect the memory usage to be low as I'm loading it in 4 bit and training it in QLoRA with Deepspeed Zero 2 across 2 T4 GPU and a CPU.
Others
It took around 12 GB (6GB for each GPU) just to load the model in 4bit!
Memory usage before OOM (during training): Error:
I also tried with FSDP:
Config file: