Open RickMeow opened 10 months ago
Yes, 3*8*A100 (40G) is enough for fine-tuning llama-2-70B
Yes, 38A100 (40G) is enough for fine-tuning llama-2-70B
Thank you for your efficient and enthusiastic answers!
I've used two different commands and I'm still getting OOM, is there something wrong with my configuration?Or the way I'm using it needs to be improved?
System info
The commands (Passwordless SSH access is possible between all machines) (1) Command 1
python -m torch.distributed.launch \
--nproc_per_node=8 \
--nnodes=3 \
--node_rank=0 \ #0,1,2
---master_addr=192.168.0.6 \
--master_port=9901 \
finetune_llama_with_qlora.py
(2) Command 2
WORLD_SIZE=24 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun \
--nproc_per_node=8 \
--master_addr=192.168.0.6 \
--master_port=9901 \
finetune_llama_with_qlora.py
The code I used finetune_llama_with_qlora.py in the "Efficient-Tuning-LLMs/examples/finetune_llm" folder for the fine-tuning process, and modified the following:
ddp_find_unused_parameters=False
setting to assist DDP.
Original codeThe changes to the code are as follows:
from typing import Dict
import torch
import transformers
……
if __name__ == '__main__':
# Set model
model_id = '/mnt/model/Llama-2-70b-hf'
……
# Set data
data = load_dataset('/mnt/git/Efficient-Tuning-LLMs/examples/finetune_llm/data')
data = data.map(lambda samples: tokenizer(samples['quote']), batched=True)
……
trainer = Trainer(
model=model,
train_dataset=data['train'],
args=TrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=1,
warmup_steps=2,
max_steps=1000,
learning_rate=2e-4,
fp16=True,
logging_steps=1,
# Set output
output_dir='/mnt/output/70B/',
optim='paged_adamw_8bit',
# add ddp setting
ddp_find_unused_parameters=False,
),
The Terminal output
(Qlora) root@gzyd29:/mnt/git/Efficient-Tuning-LLMs/examples/finetune_llm# WORLD_SIZE=24 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun \
> --nproc_per_node=8 \
> --master_addr=192.168.0.6 \
> --master_port=9901 \
> finetune_llama_with_qlora.py
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
[2023-09-02 16:18:07,265] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-02 16:18:07,265] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-02 16:18:07,265] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-02 16:18:07,307] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-02 16:18:07,325] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-02 16:18:07,464] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-02 16:18:07,464] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-02 16:18:07,516] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /mnt/anaconda/envs/Qlora did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
bin /mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /mnt/anaconda/envs/Qlora did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /mnt/anaconda/envs/Qlora did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
bin /mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /mnt/anaconda/envs/Qlora did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /mnt/anaconda/envs/Qlora did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /mnt/anaconda/envs/Qlora did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
bin /mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /mnt/anaconda/envs/Qlora did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
bin /mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /mnt/anaconda/envs/Qlora did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
Loading checkpoint shards: 7%|\u2588\u258f | 1/15 [00:13<03:03, 13.09s/it]
Traceback (most recent call last):
File "/mnt/git/Efficient-Tuning-LLMs/examples/finetune_llm/finetune_llama_with_qlora.py", line 72, in <module>
model = AutoModelForCausalLM.from_pretrained(
File "/mnt/git/transformers/src/transformers/models/auto/auto_factory.py", line 555, in from_pretrained
return model_class.from_pretrained(
File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 3175, in from_pretrained
) = cls._load_pretrained_model(
File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 3563, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 753, in _load_state_dict_into_meta_model
set_module_quantized_tensor_to_device(
File "/mnt/git/transformers/src/transformers/integrations/bitsandbytes.py", line 98, in set_module_quantized_tensor_to_device
new_value = bnb.nn.Params4bit(new_value, requires_grad=False, **kwargs).to(device)
File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 176, in to
return self.cuda(device)
File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 154, in cuda
w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics, quant_type=self.quant_type)
File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/functional.py", line 760, in quantize_4bit
out = torch.zeros(((n+1)//2, 1), dtype=torch.uint8, device=A.device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 39.45 GiB total capacity; 4.00 GiB already allocated; 95.25 MiB free; 4.04 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Loading checkpoint shards: 7%|\u2588\u258f | 1/15 [00:12<03:01, 12.97s/it]
Traceback (most recent call last):
File "/mnt/git/Efficient-Tuning-LLMs/examples/finetune_llm/finetune_llama_with_qlora.py", line 72, in <module>
model = AutoModelForCausalLM.from_pretrained(
File "/mnt/git/transformers/src/transformers/models/auto/auto_factory.py", line 555, in from_pretrained
return model_class.from_pretrained(
File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 3175, in from_pretrained
) = cls._load_pretrained_model(
File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 3563, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 753, in _load_state_dict_into_meta_model
set_module_quantized_tensor_to_device(
File "/mnt/git/transformers/src/transformers/integrations/bitsandbytes.py", line 98, in set_module_quantized_tensor_to_device
new_value = bnb.nn.Params4bit(new_value, requires_grad=False, **kwargs).to(device)
File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 176, in to
return self.cuda(device)
File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 154, in cuda
w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics, quant_type=self.quant_type)
File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/functional.py", line 760, in quantize_4bit
out = torch.zeros(((n+1)//2, 1), dtype=torch.uint8, device=A.device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 39.45 GiB total capacity; 4.00 GiB already allocated; 95.25 MiB free; 4.04 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Loading checkpoint shards: 7%|\u2588\u258f | 1/15 [00:12<03:01, 12.96s/it]
Traceback (most recent call last):
File "/mnt/git/Efficient-Tuning-LLMs/examples/finetune_llm/finetune_llama_with_qlora.py", line 72, in <module>
model = AutoModelForCausalLM.from_pretrained(
File "/mnt/git/transformers/src/transformers/models/auto/auto_factory.py", line 555, in from_pretrained
return model_class.from_pretrained(
File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 3175, in from_pretrained
) = cls._load_pretrained_model(
File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 3563, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 753, in _load_state_dict_into_meta_model
set_module_quantized_tensor_to_device(
File "/mnt/git/transformers/src/transformers/integrations/bitsandbytes.py", line 98, in set_module_quantized_tensor_to_device
Loading checkpoint shards: 7%|\u2588\u258f | 1/15 [00:13<03:02, 13.04s/it]
new_value = bnb.nn.Params4bit(new_value, requires_grad=False, **kwargs).to(device)
File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 176, in to
return self.cuda(device)
File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 154, in cuda
w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics, quant_type=self.quant_type)
File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/functional.py", line 760, in quantize_4bit
Traceback (most recent call last):
File "/mnt/git/Efficient-Tuning-LLMs/examples/finetune_llm/finetune_llama_with_qlora.py", line 72, in <module>
out = torch.zeros(((n+1)//2, 1), dtype=torch.uint8, device=A.device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 39.45 GiB total capacity; 4.00 GiB already allocated; 95.25 MiB free; 4.04 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
model = AutoModelForCausalLM.from_pretrained(
File "/mnt/git/transformers/src/transformers/models/auto/auto_factory.py", line 555, in from_pretrained
return model_class.from_pretrained(
File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 3175, in from_pretrained
) = cls._load_pretrained_model(
File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 3563, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 753, in _load_state_dict_into_meta_model
set_module_quantized_tensor_to_device(
File "/mnt/git/transformers/src/transformers/integrations/bitsandbytes.py", line 98, in set_module_quantized_tensor_to_device
new_value = bnb.nn.Params4bit(new_value, requires_grad=False, **kwargs).to(device)
File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 176, in to
return self.cuda(device)
File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 154, in cuda
w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics, quant_type=self.quant_type)
File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/functional.py", line 760, in quantize_4bit
out = torch.zeros(((n+1)//2, 1), dtype=torch.uint8, device=A.device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 39.45 GiB total capacity; 4.00 GiB already allocated; 95.25 MiB free; 4.04 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Loading checkpoint shards: 7%|\u2588\u258f | 1/15 [00:13<03:07, 13.39s/it]
Loading checkpoint shards: 7%|\u2588\u258f | 1/15 [00:13<03:04, 13.15s/it]
Traceback (most recent call last):
Traceback (most recent call last):
File "/mnt/git/Efficient-Tuning-LLMs/examples/finetune_llm/finetune_llama_with_qlora.py", line 72, in <module>
File "/mnt/git/Efficient-Tuning-LLMs/examples/finetune_llm/finetune_llama_with_qlora.py", line 72, in <module>
Traceback (most recent call last):
File "/mnt/git/Efficient-Tuning-LLMs/examples/finetune_llm/finetune_llama_with_qlora.py", line 72, in <module>
Traceback (most recent call last):
File "/mnt/git/Efficient-Tuning-LLMs/examples/finetune_llm/finetune_llama_with_qlora.py", line 72, in <module>
model = AutoModelForCausalLM.from_pretrained(model = AutoModelForCausalLM.from_pretrained(
File "/mnt/git/transformers/src/transformers/models/auto/auto_factory.py", line 555, in from_pretrained
File "/mnt/git/transformers/src/transformers/models/auto/auto_factory.py", line 555, in from_pretrained
model = AutoModelForCausalLM.from_pretrained(
File "/mnt/git/transformers/src/transformers/models/auto/auto_factory.py", line 555, in from_pretrained
return model_class.from_pretrained(
return model_class.from_pretrained(
File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 3175, in from_pretrained
File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 3175, in from_pretrained
return model_class.from_pretrained(
File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 3175, in from_pretrained
model = AutoModelForCausalLM.from_pretrained(
File "/mnt/git/transformers/src/transformers/models/auto/auto_factory.py", line 555, in from_pretrained
return model_class.from_pretrained(
File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 3175, in from_pretrained
) = cls._load_pretrained_model() = cls._load_pretrained_model() = cls._load_pretrained_model(
File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 3563, in _load_pretrained_model
File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 3563, in _load_pretrained_model
File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 3563, in _load_pretrained_model
) = cls._load_pretrained_model(
File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 3563, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 753, in _load_state_dict_into_meta_model
File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 753, in _load_state_dict_into_meta_model
File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 753, in _load_state_dict_into_meta_model
set_module_quantized_tensor_to_device(set_module_quantized_tensor_to_device(set_module_quantized_tensor_to_device(
File "/mnt/git/transformers/src/transformers/integrations/bitsandbytes.py", line 98, in set_module_quantized_tensor_to_device
File "/mnt/git/transformers/src/transformers/integrations/bitsandbytes.py", line 98, in set_module_quantized_tensor_to_device
File "/mnt/git/transformers/src/transformers/integrations/bitsandbytes.py", line 98, in set_module_quantized_tensor_to_device
new_value = bnb.nn.Params4bit(new_value, requires_grad=False, **kwargs).to(device)new_value = bnb.nn.Params4bit(new_value, requires_grad=False, **kwargs).to(device)new_value = bnb.nn.Params4bit(new_value, requires_grad=False, **kwargs).to(device)
File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 176, in to
File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 176, in to
File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 176, in to
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 753, in _load_state_dict_into_meta_model
return self.cuda(device)return self.cuda(device)return self.cuda(device)
File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 154, in cuda
File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 154, in cuda
File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 154, in cuda
w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics, quant_type=self.quant_type) w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics, quant_type=self.quant_type)
w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics, quant_type=self.quant_type)
File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/functional.py", line 760, in quantize_4bit
File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/functional.py", line 760, in quantize_4bit
File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/functional.py", line 760, in quantize_4bit
set_module_quantized_tensor_to_device(
File "/mnt/git/transformers/src/transformers/integrations/bitsandbytes.py", line 98, in set_module_quantized_tensor_to_device
new_value = bnb.nn.Params4bit(new_value, requires_grad=False, **kwargs).to(device)
File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 176, in to
out = torch.zeros(((n+1)//2, 1), dtype=torch.uint8, device=A.device)out = torch.zeros(((n+1)//2, 1), dtype=torch.uint8, device=A.device)out = torch.zeros(((n+1)//2, 1), dtype=torch.uint8, device=A.device)
torch.cudatorch.cudatorch.cudareturn self.cuda(device)..
.OutOfMemoryErrorOutOfMemoryError File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 154, in cuda
: : OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 39.45 GiB total capacity; 4.12 GiB already allocated; 95.25 MiB free; 4.15 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFCUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 39.45 GiB total capacity; 4.12 GiB already allocated; 95.25 MiB free; 4.15 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 39.45 GiB total capacity; 4.12 GiB already allocated; 95.25 MiB free; 4.15 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics, quant_type=self.quant_type)
File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/functional.py", line 760, in quantize_4bit
out = torch.zeros(((n+1)//2, 1), dtype=torch.uint8, device=A.device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 39.45 GiB total capacity; 4.12 GiB already allocated; 95.25 MiB free; 4.15 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2258899) of binary: /mnt/anaconda/envs/Qlora/bin/python
Traceback (most recent call last):
File "/mnt/anaconda/envs/Qlora/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
finetune_llama_with_qlora.py FAILED
------------------------------------------------------------
Appreciate your great work!
Is it possible to fine tune the llama-2-70B for a 3*8*A100 (40G) configuration, thanks!