Getting "Killed" when trying to finetune the model

Tizzzzy commented 2 weeks ago

System Info

python: 3.10.12

nvcc:

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

peft: 0.10.0

All other packages are the same as the requirments.txt

My local machine has 32G RAM. My gpu information: NVIDIA-SMI 550.54.15; Driver Version: 545.84; CUDA Version: 12.3; NVIDIA GeForce RTX 3070; 8192MiB

Information

[X] The official example scripts
[X] My own modified scripts

🐛 Describe the bug

I am new to machine learning. I am trying to finetune llama3 on a huggingface dataset "openbookqa" on my local machine. I used this command to run: python -m llama_recipes.finetuning --dataset "openbookqa" --custom_dataset.file "datasets/openbookqa_dataset.py" --batching_strategy "packing".

It seems like the code is first downloading the llama3-8b model, but during the download, my command line get "Killed", and the download stopped. I don't know if that is my RAM out of memory or my GPU out of memory, and I don't know how to fix it (maybe I can download a quantized version of llama3-8b? But I don't know how).

Error logs

(llama3) root@Dong:/mnt/c/Users/super/OneDrive/Desktop/research/llama-recipes# python -m llama_recipes.finetuning --dataset "openbookqa" --custom_dataset.file "datasets/openbookqa_dataset.py" --batching_strategy "packing"
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 654/654 [00:00<00:00, 1.63MB/s]
model.safetensors.index.json: 100%|██████████████████████████████████████████████████████████████████████| 23.9k/23.9k [00:00<00:00, 46.9MB/s]
model-00001-of-00004.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.98G/4.98G [02:25<00:00, 34.2MB/s]
Downloading shards:  25%|█████████████████████▌                                                                | 1/4 [02:25<07:16, 145.59s/it]
model-00002-of-00004.safetensors: 100%|██████████████████████████████████████████████████████████████████| 5.00G/5.00G [02:19<00:00, 35.8MB/s]
model-00003-of-00004.safetensors: 100%|██████████████████████████████████████████████████████████████████| 4.92G/4.92G [02:15<00:00, 36.4MB/s]
model-00004-of-00004.safetensors: 100%|██████████████████████████████████████████████████████████████████| 1.17G/1.17G [00:31<00:00, 37.0MB/s]
Downloading shards: 100%|██████████████████████████████████████████████████████████████████████████████████████| 4/4 [07:32<00:00, 113.05s/it]
Loading checkpoint shards:  25%|████████████████████    3.24s/it]
Killed

Expected behavior

I expect I can finetune the llama3

wukaixingxp commented 2 weeks ago

Hi! Meta Llama3 8B fp16 weight requires at least 15GB of GPU memory but I noticed that there is only 8GB of GPU memory in your 3070. You can try to use int8 quantization --quantization and lora method --use_peft --peft_method lora. Please check this document for more details: https://github.com/meta-llama/llama-recipes/blob/main/recipes/finetuning/singlegpu_finetuning.md#how-to-run-it. Let me know if you have any question!

Tizzzzy commented 2 weeks ago

Hi, I changed my command into this python -m llama_recipes.finetuning --use_peft --peft_method lora --quantization --model_name meta-llama/Meta-Llama-3-8B --dataset "openbookqa" --custom_dataset.file "datasets/openbookqa_dataset.py" --batching_strategy "packing" and I am still getting the error. My transformers version is 4.40.2. My bitsandbytes version is 0.43.1

(llama3) root@Dong:/mnt/c/Users/super/OneDrive/Desktop/research/llama-recipes# python -m llama_recipes.finetuning --use_peft --peft_method lora --quantization --model_name meta-llama/Meta-Llama-3-8B --dataset "openbookqa" --custom_dataset.file "datasets/openbookqa_dataset.py" --batching_strategy "packing"
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/llama_recipes/finetuning.py", line 289, in <module>
    fire.Fire(main)
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/llama_recipes/finetuning.py", line 125, in main
    model = LlamaForCausalLM.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3627, in from_pretrained
    hf_quantizer.validate_environment(device_map=device_map)
  File "/usr/local/lib/python3.10/dist-packages/transformers/quantizers/quantizer_bnb_8bit.py", line 86, in validate_environment
    raise ValueError(
ValueError:
                    Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the
                    quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules
                    in 32-bit, you need to set `llm_int8_enable_fp32_cpu_offload=True` and pass a custom `device_map` to
                    `from_pretrained`. Check
                    https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
                    for more details.

Do you know where it went wrong? How can I fix this problem

wukaixingxp commented 2 weeks ago

Hi! I think this is still a GPU memory issue: From my previous test, the 8bit model still requires at least 7.8G GPU memory, which is very close to 8GB. Because we are hitting the GPU memory limit so when we set the device_map to "auto" and using --quantization, it will try to offload some model into CPU memory and somehow run into that error. You can either try to change load_in_8bit=True to load_in_4bit=True, which should reduce the GPU memory usage to 4GB. Or follow this tutorial to create a quant_config that has llm_int8_enable_fp32_cpu_offload=True and pass into the LlamaForCausalLM.from_pretrained() function. I do not have a 3070 at my hand to test the solutions above, so please let me know if you still encounter any problem.

HamidShojanazeri commented 2 weeks ago

@Tizzzzy can you pls follow this recipe here, https://github.com/meta-llama/llama-recipes/blob/main/recipes/finetuning/singlegpu_finetuning.md, you would need to call to recipes/finetuning/finetuning.py this paired with quantization should help you run, let us know if still seeing the issue pls.

meta-llama / llama-recipes