Currently trying to train a LoRA with Llama-30b on 2 A100 40Gb GPUs (+170Gb CPU). I have set the model's parameter to use both and CPU as well for overflowing and I am loading in 8bits, however it's currently using only 1 core and it OOMs even with the smallest parameters (bath_size=1, lora rank=4, optimizer= sdg). The OOM happens during validation and saving.
Any idea why the load does not get distributed on the 2nd one as well? Or any suggestions on how to reduce the load further on just one single one?
Thank you!
Is there an existing issue for this?
[X] I have searched the existing issues
Reproduction
Load decapoda-research/llama-30b-hf
Start LoRA training using alpaca-chat format with bath size=1, rank=4, optimizer=sdg
Screenshot
No response
Logs
NFO:Loading JSON datasets...
WARNING:Found cached dataset json (/root/.cache/huggingface/datasets/json/default-bf1f0442e53bdaf7/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 196.60it/s]
INFO:Getting model ready...
INFO:Prepping for training...
INFO:Creating LoRA model...
INFO:Starting training...
{'loss': 3.4991, 'learning_rate': 1.4999999999999999e-05, 'epoch': 0.0}
{'loss': 3.438, 'learning_rate': 2.9999999999999997e-05, 'epoch': 0.0}
Exception in thread Thread-9 (threaded_run):
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/root/miniconda3/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/modules/training.py", line 413, in threaded_run
trainer.train()
File "/root/miniconda3/lib/python3.10/site-packages/transformers/trainer.py", line 1662, in train
return inner_training_loop(
File "/root/miniconda3/lib/python3.10/site-packages/transformers/trainer.py", line 1918, in _inner_training_loop
self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
File "/root/miniconda3/lib/python3.10/site-packages/transformers/trainer_callback.py", line 369, in on_step_begin
return self.call_event("on_step_begin", args, state, control)
File "/root/miniconda3/lib/python3.10/site-packages/transformers/trainer_callback.py", line 397, in call_event
result = getattr(callback, event)(
File "/modules/training.py", line 360, in on_step_begin
lora_model.save_pretrained(f"{lora_file_path}/checkpoint-{tracked.current_steps}/")
File "/root/miniconda3/lib/python3.10/site-packages/peft/peft_model.py", line 125, in save_pretrained
output_state_dict = get_peft_model_state_dict(
File "/root/miniconda3/lib/python3.10/site-packages/peft/utils/save_and_load.py", line 32, in get_peft_model_state_dict
state_dict = model.state_dict()
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1448, in state_dict
module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1448, in state_dict
module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1448, in state_dict
module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
[Previous line repeated 4 more times]
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1445, in state_dict
self._save_to_state_dict(destination, prefix, keep_vars)
File "/root/miniconda3/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 268, in _save_to_state_dict
self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices)
File "/root/miniconda3/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 100, in undo_layout
return outputs.reshape(rows, cols).contiguous()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 44.00 MiB (GPU 0; 39.56 GiB total capacity; 36.26 GiB already allocated; 4.56 MiB free; 37.90 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
INFO:Training complete, saving...
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.10/site-packages/gradio/routes.py", line 412, in run_predict
output = await app.get_blocks().process_api(
File "/root/miniconda3/lib/python3.10/site-packages/gradio/blocks.py", line 1299, in process_api
result = await self.call_function(
File "/root/miniconda3/lib/python3.10/site-packages/gradio/blocks.py", line 1035, in call_function
prediction = await anyio.to_thread.run_sync(
File "/root/miniconda3/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/root/miniconda3/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "/root/miniconda3/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
result = context.run(func, *args)
File "/root/miniconda3/lib/python3.10/site-packages/gradio/utils.py", line 491, in async_iteration
return next(iterator)
File "/modules/training.py", line 449, in do_train
lora_model.save_pretrained(lora_file_path)
File "/root/miniconda3/lib/python3.10/site-packages/peft/peft_model.py", line 125, in save_pretrained
output_state_dict = get_peft_model_state_dict(
File "/root/miniconda3/lib/python3.10/site-packages/peft/utils/save_and_load.py", line 32, in get_peft_model_state_dict
state_dict = model.state_dict()
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1448, in state_dict
module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1448, in state_dict
module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1448, in state_dict
module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
[Previous line repeated 4 more times]
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1445, in state_dict
self._save_to_state_dict(destination, prefix, keep_vars)
File "/root/miniconda3/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 268, in _save_to_state_dict
self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices)
File "/root/miniconda3/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 100, in undo_layout
return outputs.reshape(rows, cols).contiguous()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 114.00 MiB (GPU 0; 39.56 GiB total capacity; 36.06 GiB already allocated; 4.56 MiB free; 37.90 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
System Info
2 x Nvidia A100 40Gb GPUs + 170Gb CPU on Google Cloud
Running in a Docker using CUDA 12.0
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... Off | 00000000:00:04.0 Off | 0 |
| N/A 34C P0 60W / 400W | 35109MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... Off | 00000000:00:05.0 Off | 0 |
| N/A 31C P0 54W / 400W | 3MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
Describe the bug
Currently trying to train a LoRA with Llama-30b on 2 A100 40Gb GPUs (+170Gb CPU). I have set the model's parameter to use both and CPU as well for overflowing and I am loading in 8bits, however it's currently using only 1 core and it OOMs even with the smallest parameters (bath_size=1, lora rank=4, optimizer= sdg). The OOM happens during validation and saving.
Any idea why the load does not get distributed on the 2nd one as well? Or any suggestions on how to reduce the load further on just one single one?
Thank you!
Is there an existing issue for this?
Reproduction
Load decapoda-research/llama-30b-hf Start LoRA training using alpaca-chat format with bath size=1, rank=4, optimizer=sdg
Screenshot
No response
Logs
System Info