Open Tizzzzy opened 2 weeks ago
Hi! Meta Llama3 8B fp16 weight requires at least 15GB of GPU memory but I noticed that there is only 8GB of GPU memory in your 3070. You can try to use int8 quantization --quantization
and lora method --use_peft --peft_method lora
. Please check this document for more details: https://github.com/meta-llama/llama-recipes/blob/main/recipes/finetuning/singlegpu_finetuning.md#how-to-run-it. Let me know if you have any question!
Hi, I changed my command into this python -m llama_recipes.finetuning --use_peft --peft_method lora --quantization --model_name meta-llama/Meta-Llama-3-8B --dataset "openbookqa" --custom_dataset.file "datasets/openbookqa_dataset.py" --batching_strategy "packing"
and I am still getting the error. My transformers
version is 4.40.2. My bitsandbytes
version is 0.43.1
(llama3) root@Dong:/mnt/c/Users/super/OneDrive/Desktop/research/llama-recipes# python -m llama_recipes.finetuning --use_peft --peft_method lora --quantization --model_name meta-llama/Meta-Llama-3-8B --dataset "openbookqa" --custom_dataset.file "datasets/openbookqa_dataset.py" --batching_strategy "packing"
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/llama_recipes/finetuning.py", line 289, in <module>
fire.Fire(main)
File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/llama_recipes/finetuning.py", line 125, in main
model = LlamaForCausalLM.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3627, in from_pretrained
hf_quantizer.validate_environment(device_map=device_map)
File "/usr/local/lib/python3.10/dist-packages/transformers/quantizers/quantizer_bnb_8bit.py", line 86, in validate_environment
raise ValueError(
ValueError:
Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the
quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules
in 32-bit, you need to set `llm_int8_enable_fp32_cpu_offload=True` and pass a custom `device_map` to
`from_pretrained`. Check
https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
for more details.
Do you know where it went wrong? How can I fix this problem
Hi! I think this is still a GPU memory issue: From my previous test, the 8bit model still requires at least 7.8G GPU memory, which is very close to 8GB. Because we are hitting the GPU memory limit so when we set the device_map to "auto" and using --quantization
, it will try to offload some model into CPU memory and somehow run into that error. You can either try to change load_in_8bit=True to load_in_4bit=True
, which should reduce the GPU memory usage to 4GB. Or follow this tutorial to create a quant_config that has llm_int8_enable_fp32_cpu_offload=True
and pass into the LlamaForCausalLM.from_pretrained() function. I do not have a 3070 at my hand to test the solutions above, so please let me know if you still encounter any problem.
@Tizzzzy can you pls follow this recipe here, https://github.com/meta-llama/llama-recipes/blob/main/recipes/finetuning/singlegpu_finetuning.md, you would need to call to recipes/finetuning/finetuning.py
this paired with quantization should help you run, let us know if still seeing the issue pls.
System Info
All other packages are the same as the
requirments.txt
My local machine has 32G RAM. My gpu information:
NVIDIA-SMI 550.54.15
;Driver Version: 545.84
;CUDA Version: 12.3
;NVIDIA GeForce RTX 3070
;8192MiB
Information
🐛 Describe the bug
I am new to machine learning. I am trying to finetune llama3 on a huggingface dataset "openbookqa" on my local machine. I used this command to run:
python -m llama_recipes.finetuning --dataset "openbookqa" --custom_dataset.file "datasets/openbookqa_dataset.py" --batching_strategy "packing"
.It seems like the code is first downloading the llama3-8b model, but during the download, my command line get "Killed", and the download stopped. I don't know if that is my RAM out of memory or my GPU out of memory, and I don't know how to fix it (maybe I can download a quantized version of llama3-8b? But I don't know how).
Error logs
Expected behavior
I expect I can finetune the llama3