About llama-2-70B fine-tuning

Yes, 38A100 (40G) is enough for fine-tuning llama-2-70B

Thank you for your efficient and enthusiastic answers!

I've used two different commands and I'm still getting OOM, is there something wrong with my configuration?Or the way I'm using it needs to be improved?

System info

OS: Ubuntu 20.04.6 LTS
Configured as 3 groups with 8*A100 graphics cards (total of 24 A100-40G graphics cards)
Python = 3.10

The commands (Passwordless SSH access is possible between all machines) (1) Command 1

python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --nnodes=3 \
    --node_rank=0 \ #0,1,2
    ---master_addr=192.168.0.6 \
    --master_port=9901 \
    finetune_llama_with_qlora.py

(2) Command 2

WORLD_SIZE=24 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun \
--nproc_per_node=8 \
--master_addr=192.168.0.6 \
--master_port=9901 \
finetune_llama_with_qlora.py

The code I used finetune_llama_with_qlora.py in the "Efficient-Tuning-LLMs/examples/finetune_llm" folder for the fine-tuning process, and modified the following:

The model address is changed to a local address, and the output location is changed to a local location.
Change the data address to the local address.
Added ddp_find_unused_parameters=False setting to assist DDP. Original code

The changes to the code are as follows:


from typing import Dict

import torch
import transformers
    ……

if __name__ == '__main__':
    # Set model
    model_id = '/mnt/model/Llama-2-70b-hf'
    ……
    # Set data
    data = load_dataset('/mnt/git/Efficient-Tuning-LLMs/examples/finetune_llm/data')
    data = data.map(lambda samples: tokenizer(samples['quote']), batched=True)
    ……
    trainer = Trainer(
        model=model,
        train_dataset=data['train'],
        args=TrainingArguments(
            per_device_train_batch_size=1,
            gradient_accumulation_steps=1,
            warmup_steps=2,
            max_steps=1000,
            learning_rate=2e-4,
            fp16=True,
            logging_steps=1,
            # Set output
            output_dir='/mnt/output/70B/',
            optim='paged_adamw_8bit',
            # add ddp setting
            ddp_find_unused_parameters=False,
        ),

The Terminal output

(Qlora) root@gzyd29:/mnt/git/Efficient-Tuning-LLMs/examples/finetune_llm# WORLD_SIZE=24 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun \
> --nproc_per_node=8 \
> --master_addr=192.168.0.6 \
> --master_port=9901 \
> finetune_llama_with_qlora.py
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[2023-09-02 16:18:07,265] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-02 16:18:07,265] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-02 16:18:07,265] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-02 16:18:07,307] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-02 16:18:07,325] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-02 16:18:07,464] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-02 16:18:07,464] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-02 16:18:07,516] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /mnt/anaconda/envs/Qlora did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
bin /mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /mnt/anaconda/envs/Qlora did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /mnt/anaconda/envs/Qlora did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
bin /mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /mnt/anaconda/envs/Qlora did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /mnt/anaconda/envs/Qlora did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /mnt/anaconda/envs/Qlora did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
bin /mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /mnt/anaconda/envs/Qlora did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
bin /mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /mnt/anaconda/envs/Qlora did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
Loading checkpoint shards:   7%|\u2588\u258f               | 1/15 [00:13<03:03, 13.09s/it]
Traceback (most recent call last):
  File "/mnt/git/Efficient-Tuning-LLMs/examples/finetune_llm/finetune_llama_with_qlora.py", line 72, in <module>
    model = AutoModelForCausalLM.from_pretrained(
  File "/mnt/git/transformers/src/transformers/models/auto/auto_factory.py", line 555, in from_pretrained
    return model_class.from_pretrained(
  File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 3175, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 3563, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 753, in _load_state_dict_into_meta_model
    set_module_quantized_tensor_to_device(
  File "/mnt/git/transformers/src/transformers/integrations/bitsandbytes.py", line 98, in set_module_quantized_tensor_to_device
    new_value = bnb.nn.Params4bit(new_value, requires_grad=False, **kwargs).to(device)
  File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 176, in to
    return self.cuda(device)
  File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 154, in cuda
    w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics, quant_type=self.quant_type)
  File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/functional.py", line 760, in quantize_4bit
    out = torch.zeros(((n+1)//2, 1), dtype=torch.uint8, device=A.device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 39.45 GiB total capacity; 4.00 GiB already allocated; 95.25 MiB free; 4.04 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Loading checkpoint shards:   7%|\u2588\u258f               | 1/15 [00:12<03:01, 12.97s/it]
Traceback (most recent call last):
  File "/mnt/git/Efficient-Tuning-LLMs/examples/finetune_llm/finetune_llama_with_qlora.py", line 72, in <module>
    model = AutoModelForCausalLM.from_pretrained(
  File "/mnt/git/transformers/src/transformers/models/auto/auto_factory.py", line 555, in from_pretrained
    return model_class.from_pretrained(
  File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 3175, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 3563, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 753, in _load_state_dict_into_meta_model
    set_module_quantized_tensor_to_device(
  File "/mnt/git/transformers/src/transformers/integrations/bitsandbytes.py", line 98, in set_module_quantized_tensor_to_device
    new_value = bnb.nn.Params4bit(new_value, requires_grad=False, **kwargs).to(device)
  File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 176, in to
    return self.cuda(device)
  File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 154, in cuda
    w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics, quant_type=self.quant_type)
  File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/functional.py", line 760, in quantize_4bit
    out = torch.zeros(((n+1)//2, 1), dtype=torch.uint8, device=A.device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 39.45 GiB total capacity; 4.00 GiB already allocated; 95.25 MiB free; 4.04 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Loading checkpoint shards:   7%|\u2588\u258f               | 1/15 [00:12<03:01, 12.96s/it]
Traceback (most recent call last):
  File "/mnt/git/Efficient-Tuning-LLMs/examples/finetune_llm/finetune_llama_with_qlora.py", line 72, in <module>
    model = AutoModelForCausalLM.from_pretrained(
  File "/mnt/git/transformers/src/transformers/models/auto/auto_factory.py", line 555, in from_pretrained
    return model_class.from_pretrained(
  File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 3175, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 3563, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 753, in _load_state_dict_into_meta_model
    set_module_quantized_tensor_to_device(
  File "/mnt/git/transformers/src/transformers/integrations/bitsandbytes.py", line 98, in set_module_quantized_tensor_to_device
Loading checkpoint shards:   7%|\u2588\u258f               | 1/15 [00:13<03:02, 13.04s/it]
    new_value = bnb.nn.Params4bit(new_value, requires_grad=False, **kwargs).to(device)
  File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 176, in to
    return self.cuda(device)
  File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 154, in cuda
    w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics, quant_type=self.quant_type)
  File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/functional.py", line 760, in quantize_4bit
Traceback (most recent call last):
  File "/mnt/git/Efficient-Tuning-LLMs/examples/finetune_llm/finetune_llama_with_qlora.py", line 72, in <module>
    out = torch.zeros(((n+1)//2, 1), dtype=torch.uint8, device=A.device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 39.45 GiB total capacity; 4.00 GiB already allocated; 95.25 MiB free; 4.04 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    model = AutoModelForCausalLM.from_pretrained(
  File "/mnt/git/transformers/src/transformers/models/auto/auto_factory.py", line 555, in from_pretrained
    return model_class.from_pretrained(
  File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 3175, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 3563, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 753, in _load_state_dict_into_meta_model
    set_module_quantized_tensor_to_device(
  File "/mnt/git/transformers/src/transformers/integrations/bitsandbytes.py", line 98, in set_module_quantized_tensor_to_device
    new_value = bnb.nn.Params4bit(new_value, requires_grad=False, **kwargs).to(device)
  File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 176, in to
    return self.cuda(device)
  File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 154, in cuda
    w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics, quant_type=self.quant_type)
  File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/functional.py", line 760, in quantize_4bit
    out = torch.zeros(((n+1)//2, 1), dtype=torch.uint8, device=A.device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 39.45 GiB total capacity; 4.00 GiB already allocated; 95.25 MiB free; 4.04 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Loading checkpoint shards:   7%|\u2588\u258f               | 1/15 [00:13<03:07, 13.39s/it]

Loading checkpoint shards:   7%|\u2588\u258f               | 1/15 [00:13<03:04, 13.15s/it]
Traceback (most recent call last):
Traceback (most recent call last):
  File "/mnt/git/Efficient-Tuning-LLMs/examples/finetune_llm/finetune_llama_with_qlora.py", line 72, in <module>
  File "/mnt/git/Efficient-Tuning-LLMs/examples/finetune_llm/finetune_llama_with_qlora.py", line 72, in <module>
Traceback (most recent call last):
  File "/mnt/git/Efficient-Tuning-LLMs/examples/finetune_llm/finetune_llama_with_qlora.py", line 72, in <module>
Traceback (most recent call last):
  File "/mnt/git/Efficient-Tuning-LLMs/examples/finetune_llm/finetune_llama_with_qlora.py", line 72, in <module>
        model = AutoModelForCausalLM.from_pretrained(model = AutoModelForCausalLM.from_pretrained(

  File "/mnt/git/transformers/src/transformers/models/auto/auto_factory.py", line 555, in from_pretrained
  File "/mnt/git/transformers/src/transformers/models/auto/auto_factory.py", line 555, in from_pretrained
    model = AutoModelForCausalLM.from_pretrained(
  File "/mnt/git/transformers/src/transformers/models/auto/auto_factory.py", line 555, in from_pretrained
    return model_class.from_pretrained(    
return model_class.from_pretrained(
  File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 3175, in from_pretrained
  File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 3175, in from_pretrained
    return model_class.from_pretrained(
  File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 3175, in from_pretrained
    model = AutoModelForCausalLM.from_pretrained(
  File "/mnt/git/transformers/src/transformers/models/auto/auto_factory.py", line 555, in from_pretrained
    return model_class.from_pretrained(
  File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 3175, in from_pretrained
            ) = cls._load_pretrained_model() = cls._load_pretrained_model() = cls._load_pretrained_model(

  File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 3563, in _load_pretrained_model
  File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 3563, in _load_pretrained_model
  File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 3563, in _load_pretrained_model
    ) = cls._load_pretrained_model(
  File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 3563, in _load_pretrained_model
            new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(

  File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 753, in _load_state_dict_into_meta_model
  File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 753, in _load_state_dict_into_meta_model
  File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 753, in _load_state_dict_into_meta_model
            set_module_quantized_tensor_to_device(set_module_quantized_tensor_to_device(set_module_quantized_tensor_to_device(

  File "/mnt/git/transformers/src/transformers/integrations/bitsandbytes.py", line 98, in set_module_quantized_tensor_to_device
  File "/mnt/git/transformers/src/transformers/integrations/bitsandbytes.py", line 98, in set_module_quantized_tensor_to_device
  File "/mnt/git/transformers/src/transformers/integrations/bitsandbytes.py", line 98, in set_module_quantized_tensor_to_device
            new_value = bnb.nn.Params4bit(new_value, requires_grad=False, **kwargs).to(device)new_value = bnb.nn.Params4bit(new_value, requires_grad=False, **kwargs).to(device)new_value = bnb.nn.Params4bit(new_value, requires_grad=False, **kwargs).to(device)

  File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 176, in to
  File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 176, in to
  File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 176, in to
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/mnt/git/transformers/src/transformers/modeling_utils.py", line 753, in _load_state_dict_into_meta_model
            return self.cuda(device)return self.cuda(device)return self.cuda(device)

  File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 154, in cuda
  File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 154, in cuda
  File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 154, in cuda
        w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics, quant_type=self.quant_type)    w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics, quant_type=self.quant_type)
w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics, quant_type=self.quant_type)
  File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/functional.py", line 760, in quantize_4bit

  File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/functional.py", line 760, in quantize_4bit
  File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/functional.py", line 760, in quantize_4bit
    set_module_quantized_tensor_to_device(
  File "/mnt/git/transformers/src/transformers/integrations/bitsandbytes.py", line 98, in set_module_quantized_tensor_to_device
    new_value = bnb.nn.Params4bit(new_value, requires_grad=False, **kwargs).to(device)
  File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 176, in to
            out = torch.zeros(((n+1)//2, 1), dtype=torch.uint8, device=A.device)out = torch.zeros(((n+1)//2, 1), dtype=torch.uint8, device=A.device)out = torch.zeros(((n+1)//2, 1), dtype=torch.uint8, device=A.device)

    torch.cudatorch.cudatorch.cudareturn self.cuda(device)..
.OutOfMemoryErrorOutOfMemoryError  File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 154, in cuda
: : OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 39.45 GiB total capacity; 4.12 GiB already allocated; 95.25 MiB free; 4.15 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFCUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 39.45 GiB total capacity; 4.12 GiB already allocated; 95.25 MiB free; 4.15 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 39.45 GiB total capacity; 4.12 GiB already allocated; 95.25 MiB free; 4.15 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

    w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics, quant_type=self.quant_type)
  File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/bitsandbytes/functional.py", line 760, in quantize_4bit
    out = torch.zeros(((n+1)//2, 1), dtype=torch.uint8, device=A.device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 39.45 GiB total capacity; 4.12 GiB already allocated; 95.25 MiB free; 4.15 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2258899) of binary: /mnt/anaconda/envs/Qlora/bin/python
Traceback (most recent call last):
  File "/mnt/anaconda/envs/Qlora/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/anaconda/envs/Qlora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
finetune_llama_with_qlora.py FAILED
------------------------------------------------------------

jianzhnie / LLamaTuner

About llama-2-70B fine-tuning #91