Closed SeekPoint closed 1 year ago
It seems that your GPU's memory is insufficient. Can you tell me about your computer configuration?
RTX 3090 24GB
CPU mem is 32GB
Under the "examples" folder, I have added a minimal example for fine-tuning the llama7b model. Please feel free to try it again.
(base) ub2004@ub2004-B85M-A0:~/llm_dev/Chinese-Guanaco/examples$ python3 finetune_llama_with_qlora.py --model_name_or_path /data-ssd-1t/hf_model/llama-7b-hf --data_path tatsu-lab/alpaca --output_dir work_dir_lora/ --num_train_epochs 1 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 8 --evaluation_strategy "no" --save_strategy "steps" --save_steps 100 --save_total_limit 5 --learning_rate 1e-4 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --model_max_length 128 --logging_steps 1
===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
bin /home/ub2004/anaconda3/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
/home/ub2004/anaconda3/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: /home/ub2004/anaconda3 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
/home/ub2004/anaconda3/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0'), PosixPath('/usr/local/cuda/lib64/libcudart.so')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get CUDA error: invalid device function
errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/ub2004/anaconda3/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
Loading checkpoint shards: 91%|██████████████████████████████████████████████████████████████████████████████████████████████████▏ | 30/33 [00:23<00:02, 1.27it/s]
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/ub2004/llm_dev/Chinese-Guanaco/examples/finetune_llama_with_qlora.py:72 in accelerate
│
│ ❱ 698 │ │ │ set_module_tensor_to_device(model, param_name, param_device, set_module_kw │
│ 699 │ │ else: │
│ 700 │ │ │ if param.dtype == torch.int8 and param_name.replace("weight", "SCB") in stat │
│ 701 │ │ │ │ fp16_statistics = state_dict[param_name.replace("weight", "SCB")] │
│ │
│ /home/ub2004/anaconda3/lib/python3.10/site-packages/accelerate/utils/modeling.py:149 in │
│ set_module_tensor_to_device │
│ │
│ 146 │ │ if value is None: │
│ 147 │ │ │ new_value = old_value.to(device) │
│ 148 │ │ elif isinstance(value, torch.Tensor): │
│ ❱ 149 │ │ │ new_value = value.to(device) │
│ 150 │ │ else: │
│ 151 │ │ │ new_value = torch.tensor(value, device=device) │
│ 152 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 23.70 GiB total capacity; 23.04 GiB already allocated; 14.12 MiB free; 23.04 GiB reserved in
total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and
PYTORCH_CUDA_ALLOC_CONF
(base) ub2004@ub2004-B85M-A0:~/llm_dev/Chinese-Guanaco/examples$
try : --max_memory_MB 48000 \
Under the "examples" folder, I have added a minimal example for fine-tuning the llama7b model. Please feel free to try it again. I have tried but still got bugs:
Traceback (most recent call last): File "/home/xdx/baichuan1/finetune/Efficient-Tuning-LLMs/baichuan7b_demo.py", line 23, in <module> main(load_in_8bit, model_path) File "/home/xdx/baichuan1/finetune/Efficient-Tuning-LLMs/baichuan7b_demo.py", line 8, in main model = AutoModelForCausalLM.from_pretrained( File "/home/xdx/miniconda3/envs/baichuan/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 558, in from_pretrained return model_class.from_pretrained( File "/home/xdx/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 658, in from_pretrained return super(BaichuanForCausalLM, cls).from_pretrained(pretrained_model_name_or_path, *model_args, File "/home/xdx/miniconda3/envs/baichuan/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2959, in from_pretrained model = cls(config, *model_args, **model_kwargs) File "/home/xdx/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 531, in __init__ if hasattr(config, "quantization_config") and config.quantization_config['load_in_4bit']: TypeError: 'BitsAndBytesConfig' object is not subscriptable
here is my code in linux >>>
CUDA_VISIBLE_DEVICES=4,5 python baichuan7b_demo.py \ --model_name_or_path ../../../../baichuan-inc/Baichuan2-7B-Chat \ --dataset_cfg ./data/alpaca_zh_pcyn.yaml \ --output_dir ../../../..oasst1-baichuan-7b \ --num_train_epochs 4 \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 8 \ --evaluation_strategy steps \ --eval_steps 50 \ --save_strategy steps \ --save_total_limit 5 \ --save_steps 100 \ --logging_strategy steps \ --logging_steps 1 \ --learning_rate 0.0002 \ --warmup_ratio 0.03 \ --weight_decay 0.0 \ --lr_scheduler_type constant \ --adam_beta2 0.999 \ --max_grad_norm 0.3 \ --max_new_tokens 32 \ --lora_r 64 \ --lora_alpha 16 \ --lora_dropout 0.1 \ --double_quant \ --quant_type nf4 \ --fp16 \ --bits 4 \ --gradient_checkpointing \ --trust_remote_code \ --do_train \ --do_eval \ --sample_generate \ --data_seed 42 \ --seed 0 \ --max_memory_MB 48000
(gh_Chinese-Guanaco) ub2004@ub2004-B85M-A0:~/llm_dev/Chinese-Guanaco$ python3 qlora_int8_finetune.py --model_name_or_path /data-ssd-1t/hf_model/llama-7b-hf --data_path tatsu-lab/alpaca --output_dir work_dir_lora/ --num_train_epochs 3 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 8 --evaluation_strategy "no" --save_strategy "steps" --save_steps 500 --save_total_limit 5 --learning_rate 1e-4 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --model_max_length 2048 --logging_steps 1 [2023-06-11 00:48:41,928] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /home/ub2004/anaconda3/envs/gh_Chinese-Guanaco/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so /home/ub2004/anaconda3/envs/gh_Chinese-Guanaco/lib/python3.10/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. warn("The installed version of bitsandbytes was compiled without GPU support. " /home/ub2004/anaconda3/envs/gh_Chinese-Guanaco/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32 CUDA SETUP: Loading binary /home/ub2004/anaconda3/envs/gh_Chinese-Guanaco/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so... The model weights are not tied. Please use the
train(load_in_8bit=True)
File "/home/ub2004/llm_dev/Chinese-Guanaco/qlora_int8_finetune.py", line 234, in train
model = AutoModelForCausalLM.from_pretrained(
File "/home/ub2004/anaconda3/envs/gh_Chinese-Guanaco/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 484, in from_pretrained
return model_class.from_pretrained(
File "/home/ub2004/anaconda3/envs/gh_Chinese-Guanaco/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2819, in from_pretrained
raise ValueError(
ValueError:
Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
the quantized model. If you want to dispatch the model on the CPU or the disk while keeping
these modules in 32-bit, you need to set
tie_weights
method before using theinfer_auto_device
function. Traceback (most recent call last): File "/home/ub2004/llm_dev/Chinese-Guanaco/qlora_int8_finetune.py", line 338, inload_in_8bit_fp32_cpu_offload=True
and pass a customdevice_map
tofrom_pretrained
. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details.(gh_Chinese-Guanaco) ub2004@ub2004-B85M-A0:~/llm_dev/Chinese-Guanaco$