meta-llama / llama-recipes

Scripts for fine-tuning Meta Llama3 with composable FSDP & PEFT methods to cover single/multi-node GPUs. Supports default & custom datasets for applications such as summarization and Q&A. Supporting a number of candid inference solutions such as HF TGI, VLLM for local or cloud deployment. Demo apps to showcase Meta Llama3 for WhatsApp & Messenger.
10.34k stars 1.47k forks source link

Could not finetune llama 3 on multiple GPUs #556

Open 1155157110 opened 3 weeks ago

1155157110 commented 3 weeks ago

System Info

llama-recipes Version 0.0.2

PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.31

Python version: 3.12.3 | packaged by conda-forge | (main, Apr 15 2024, 18:38:13) [GCC 12.3.0] (64-bit runtime) Python platform: Linux-5.15.0-107-generic-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 2080 Ti GPU 1: NVIDIA GeForce RTX 2080 Ti (each GPU has 22GiB GPU memory)

Nvidia driver version: 550.54.14 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

Information

🐛 Describe the bug

I tried to finetune llama 3 on my custom dataset. It seems that training with a single GPU is working as expected(only that more GPU memory is needed to proceed the finetuning):

CUDA_VISIBLE_DEVICES=0,1 && torchrun --nnodes 1 --nproc_per_node 1 recipes/finetuning/finetuning.py --enable_fsdp --use_peft --peft_method lora --model_name ../Meta-Llama-3-8B/8B --fsdp_config.pure_bf16 --batch_size_training 1 --dataset custom_dataset --custom_dataset.file "examples/chatbot_dataset.py:get_preprocessed_custom" --output_dir ../Meta-Llama-3-8B/chatbot\(finetuned_on_15k\)/epoch1 --num_epochs 1 --save_model

However, training on multiple GPUs failed:

CUDA_VISIBLE_DEVICES=0,1 && torchrun --nnodes 1 --nproc_per_node 2 recipes/finetuning/finetuning.py --enable_fsdp --use_peft --peft_method lora --model_name ../Meta-Llama-3-8B/8B --fsdp_config.pure_bf16 --batch_size_training 1 --dataset custom_dataset --custom_dataset.file "examples/chatbot_dataset.py:get_preprocessed_custom" --output_dir ../Meta-Llama-3-8B/chatbot\(finetuned_on_15k\)/epoch1 --num_epochs 1 --save_model

Error logs

(llama3_recipes) openwifi-lab2@openwifilab2-desktop:~/Desktop/llama/llama3/llama-recipes$ CUDA_VISIBLE_DEVICES=0,1 && torchrun --nnodes 1 --nproc_per_node 2 recipes/finetuning/finetuning.py \       
 --enable_fsdp --use_peft --peft_method lora \        --model_name ../Meta-Llama-3-8B/8B \        --fsdp_config.pure_bf16 \        --batch_size_training 1 \        --dataset custom_dataset \        
--custom_dataset.file "examples/chatbot_dataset.py:get_preprocessed_custom" \        --output_dir ../Meta-Llama-3-8B/chatbot\(finetuned_on_15k\)/epoch1 \        --num_epochs 1 \        --save_model
W0608 01:04:59.533000 139778692310848 torch/distributed/run.py:757] 
W0608 01:04:59.533000 139778692310848 torch/distributed/run.py:757] *****************************************
W0608 01:04:59.533000 139778692310848 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0608 01:04:59.533000 139778692310848 torch/distributed/run.py:757] *****************************************
Clearing GPU cache for all ranks
--> Running with torch dist debug set to detail
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:17<00:00,  4.41s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
--> Model ../Meta-Llama-3-8B/8B

--> ../Meta-Llama-3-8B/8B has 8030.261248 Million params

W0608 01:05:24.542000 139778692310848 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 16658 closing signal SIGTERM
E0608 01:05:25.809000 139778692310848 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -9) local_rank: 1 (pid: 16659) of binary: /home/openwifi-lab2/miniconda3/envs/llama3_recipes/bin/python3.12
Traceback (most recent call last):
  File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
recipes/finetuning/finetuning.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-08_01:05:24
  host      : openwifilab2-desktop
  rank      : 1 (local_rank: 1)
  exitcode  : -9 (pid: 16659)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 16659
======================================================

Expected behavior

It is expected that the script for multi-gpu runs properly.

wukaixingxp commented 3 weeks ago

Hi! The exit code -9 usually means system out-of-memory error as discussed here. In your case, the worker 1 run into OOM. I suggest you to open another terminal to start a htop process viewer and watch closely how your training process uses the memory and how much system memory left. Alternatively, you can use dmesg -T | egrep -i 'killed process' to get error message mentioned in this issue. You can first try our official fine-tuning examples to see if your system is working. My guess is that your dataset may be too big to load into your memory and you should format your dataset class as a iterable style dataset as discussed here. Let me know if you have any additional questions.

1155157110 commented 3 weeks ago

Hi! The exit code -9 usually means system out-of-memory error as discussed here. In your case, the worker 1 run into OOM. I suggest you to open another terminal to start a htop process viewer and watch closely how your training process uses the memory and how much system memory left. Alternatively, you can use dmesg -T | egrep -i 'killed process' to get error message mentioned in this issue. You can first try our official fine-tuning examples to see if your system is working. My guess is that your dataset may be too big to load into your memory and you should format your dataset class as a iterable style dataset as discussed here. Let me know if you have any additional questions.

Thanks for your reply. I tried dmesg -T | egrep -i 'killed process' and the output is

[Sun Jun  9 11:49:36 2024] Out of memory: Killed process 7815 (pt_main_thread) total-vm:45371652kB, anon-rss:31360488kB, file-rss:72116kB, shmem-rss:4kB, UID:1000 pgtables:64944kB oom_score_adj:0

The script runs OOM, but the GPU usage is always low throughout the execution of the training script with 2 GPUs

The dataset is only 68M, and it runs properly on llama 2 (with same training arguments). Another point is that when training with only 1 GPU, torchrun report CUDA out of memory at the training process (after the dataset have been loaded), that is reasonable as only one of my GPU cannot support the training (only 22GiB for 1 GPU card).

I have also monitored the CPU usage as well as GPU usage, when run the script to train with one GPU. I found that the CPU usage is not as high as when training with 2 GPUs. VIRT usage is about 30 GiB but RES is only a few GiB, while the GPU usage almost full. That's what's been confusing me.

HamidShojanazeri commented 3 weeks ago

@1155157110 you would need to use --quantization flag, as you are training on 2080 with 22GB, meaning total of 44GB that would not cut it. Pls let us know if with quantization you can bypass the issue.

1155157110 commented 3 weeks ago

@1155157110 you would need to use --quantization flag, as you are training on 2080 with 22GB, meaning total of 44GB that would not cut it. Pls let us know if with quantization you can bypass the issue.

Thanks for your reply! By using the --quantization flag, My training script is:

CUDA_VISIBLE_DEVICES=0,1 && torchrun --nnodes 1 --nproc_per_node 2 recipes/finetuning/finetuning.py --enable_fsdp --use_peft --peft_method lora --model_name ../Meta-Llama-3-8B/8B --fsdp_config.pure_bf16 --batch_size_training 1 --dataset custom_dataset --custom_dataset.file "examples/chatbot_dataset.py:get_preprocessed_custom" --output_dir ../Meta-Llama-3-8B/chatbot\(finetuned_on_15k\)/epoch1 --num_epochs 1 --save_model --quantization

And I get the following error:

W0611 13:54:17.695000 140427661633344 torch/distributed/run.py:757] 
W0611 13:54:17.695000 140427661633344 torch/distributed/run.py:757] *****************************************
W0611 13:54:17.695000 140427661633344 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0611 13:54:17.695000 140427661633344 torch/distributed/run.py:757] *****************************************
Clearing GPU cache for all ranks
--> Running with torch dist debug set to detail
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:16<00:00,  4.23s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:17<00:00,  4.25s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/openwifi-lab2/Desktop/llama/llama3/llama-recipes/recipes/finetuning/finetuning.py", line 8, in <module>
[rank1]:     fire.Fire(main)
[rank1]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/fire/core.py", line 143, in Fire
[rank1]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank1]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/fire/core.py", line 477, in _Fire
[rank1]:     component, remaining_args = _CallAndUpdateTrace(
[rank1]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank1]:     component = fn(*varargs, **kwargs)
[rank1]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/openwifi-lab2/Desktop/llama/llama3/llama-recipes/src/llama_recipes/finetuning.py", line 151, in main
[rank1]:     model.to(torch.bfloat16)
[rank1]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/accelerate/big_modeling.py", line 456, in wrapper
[rank1]:     return fn(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/transformers/modeling_utils.py", line 2702, in to
[rank1]:     raise ValueError(
[rank1]: ValueError: `.to` is not supported for `4-bit` or `8-bit` bitsandbytes models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correct `dtype`.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
--> Model ../Meta-Llama-3-8B/8B

--> ../Meta-Llama-3-8B/8B has 1050.939392 Million params

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/openwifi-lab2/Desktop/llama/llama3/llama-recipes/recipes/finetuning/finetuning.py", line 8, in <module>
[rank0]:     fire.Fire(main)
[rank0]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/fire/core.py", line 143, in Fire
[rank0]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank0]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/fire/core.py", line 477, in _Fire
[rank0]:     component, remaining_args = _CallAndUpdateTrace(
[rank0]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank0]:     component = fn(*varargs, **kwargs)
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/openwifi-lab2/Desktop/llama/llama3/llama-recipes/src/llama_recipes/finetuning.py", line 151, in main
[rank0]:     model.to(torch.bfloat16)
[rank0]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/accelerate/big_modeling.py", line 456, in wrapper
[rank0]:     return fn(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/transformers/modeling_utils.py", line 2702, in to
[rank0]:     raise ValueError(
[rank0]: ValueError: `.to` is not supported for `4-bit` or `8-bit` bitsandbytes models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correct `dtype`.
E0611 13:54:42.701000 140427661633344 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 901740) of binary: /home/openwifi-lab2/miniconda3/envs/llama3_recipes/bin/python3.12
Traceback (most recent call last):
  File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
recipes/finetuning/finetuning.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-06-11_13:54:42
  host      : openwifilab2-desktop
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 901741)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-11_13:54:42
  host      : openwifilab2-desktop
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 901740)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

I have used --fsdp_config.pure_bf16 in my script. however, it seems that llama uses load_in_8bit=True when loading the pretrained model? I have upgraded accelerate to the latest version (accelerate==0.31.0)

LeonChengg commented 1 day ago

I got the same problem with my dataset. When I run with my code and _get_customdatasets:myfunctions, it can be runed with single GPU but not works for multi-GPU ( NVIDIA RTX A6000). I can run the examples dataset, OpenAssistant/oasst1, but when I use the my datasets, which is smaller but can not works for multi-GPUs.

wukaixingxp commented 1 day ago

Hi! --quantization will set load_in_8bit=True. Can you try --quantization without --fsdp_config.pure_bf16, let me know if there is any problem @1155157110. @LeonChengg

1155157110 commented 51 minutes ago

Hi! --quantization will set load_in_8bit=True. Can you try --quantization without --fsdp_config.pure_bf16, let me know if there is any problem @1155157110. @LeonChengg

Thanks for your reply! However, the training script with --quantization while without --fsdp_config.pure_bf16 produces the following errors:

$ CUDA_VISIBLE_DEVICES=0,1 && torchrun --nnodes 1 --nproc_per_node 2 recipes/finetuning/finetuning.py --enable_fsdp --use_peft --peft_method lora --model_name ../Meta-Llama-3-8B/8B --batch_size_training 1 --dataset custom_dataset --custom_dataset.file "examples/chatbot_dataset.py:get_preprocessed_custom" --output_dir ../Meta-Llama-3-8B/chatbot\(finetuned_on_15k\)/epoch1 --num_epochs 1 --save_model --quantization
W0703 15:19:50.068000 140496818579264 torch/distributed/run.py:757] 
W0703 15:19:50.068000 140496818579264 torch/distributed/run.py:757] *****************************************
W0703 15:19:50.068000 140496818579264 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0703 15:19:50.068000 140496818579264 torch/distributed/run.py:757] *****************************************
Clearing GPU cache for all ranks
--> Running with torch dist debug set to detail
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:16<00:00,  4.17s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:16<00:00,  4.19s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.0424
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/openwifi-lab2/Desktop/llama/llama3/llama-recipes/recipes/finetuning/finetuning.py", line 8, in <module>
[rank1]:     fire.Fire(main)
[rank1]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/fire/core.py", line 143, in Fire
[rank1]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank1]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/fire/core.py", line 477, in _Fire
[rank1]:     component, remaining_args = _CallAndUpdateTrace(
[rank1]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank1]:     component = fn(*varargs, **kwargs)
[rank1]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/openwifi-lab2/Desktop/llama/llama3/llama-recipes/src/llama_recipes/finetuning.py", line 186, in main
[rank1]:     model = FSDP(
[rank1]:             ^^^^^
[rank1]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 485, in __init__
[rank1]:     _auto_wrap(
[rank1]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap
[rank1]:     _recursive_wrap(**recursive_wrap_kwargs, **root_kwargs)  # type: ignore[arg-type]
[rank1]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
[rank1]:     wrapped_child, num_wrapped_params = _recursive_wrap(
[rank1]:                                         ^^^^^^^^^^^^^^^^
[rank1]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
[rank1]:     wrapped_child, num_wrapped_params = _recursive_wrap(
[rank1]:                                         ^^^^^^^^^^^^^^^^
[rank1]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
[rank1]:     wrapped_child, num_wrapped_params = _recursive_wrap(
[rank1]:                                         ^^^^^^^^^^^^^^^^
[rank1]:   [Previous line repeated 6 more times]
[rank1]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap
[rank1]:     return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap
[rank1]:     return wrapper_cls(module, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 511, in __init__
[rank1]:     _init_param_handle_from_module(
[rank1]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/_init_utils.py", line 582, in _init_param_handle_from_module
[rank1]:     state.compute_device = _get_compute_device(
[rank1]:                            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/_init_utils.py", line 1045, in _get_compute_device
[rank1]:     raise ValueError(
[rank1]: ValueError: Inconsistent compute device and `device_id` on rank 1: cuda:0 vs cuda:1
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
--> Model ../Meta-Llama-3-8B/8B

--> ../Meta-Llama-3-8B/8B has 1050.939392 Million params

trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.0424
bFloat16 enabled for mixed precision - using bfSixteen policy
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/openwifi-lab2/Desktop/llama/llama3/llama-recipes/recipes/finetuning/finetuning.py", line 8, in <module>
[rank0]:     fire.Fire(main)
[rank0]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/fire/core.py", line 143, in Fire
[rank0]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank0]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/fire/core.py", line 477, in _Fire
[rank0]:     component, remaining_args = _CallAndUpdateTrace(
[rank0]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank0]:     component = fn(*varargs, **kwargs)
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/openwifi-lab2/Desktop/llama/llama3/llama-recipes/src/llama_recipes/finetuning.py", line 186, in main
[rank0]:     model = FSDP(
[rank0]:             ^^^^^
[rank0]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 485, in __init__
[rank0]:     _auto_wrap(
[rank0]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap
[rank0]:     _recursive_wrap(**recursive_wrap_kwargs, **root_kwargs)  # type: ignore[arg-type]
[rank0]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
[rank0]:     wrapped_child, num_wrapped_params = _recursive_wrap(
[rank0]:                                         ^^^^^^^^^^^^^^^^
[rank0]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
[rank0]:     wrapped_child, num_wrapped_params = _recursive_wrap(
[rank0]:                                         ^^^^^^^^^^^^^^^^
[rank0]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
[rank0]:     wrapped_child, num_wrapped_params = _recursive_wrap(
[rank0]:                                         ^^^^^^^^^^^^^^^^
[rank0]:   [Previous line repeated 2 more times]
[rank0]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap
[rank0]:     return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap
[rank0]:     return wrapper_cls(module, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 511, in __init__
[rank0]:     _init_param_handle_from_module(
[rank0]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/_init_utils.py", line 598, in _init_param_handle_from_module
[rank0]:     _init_param_handle_from_params(state, managed_params, fully_sharded_module)
[rank0]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/_init_utils.py", line 610, in _init_param_handle_from_params
[rank0]:     handle = FlatParamHandle(
[rank0]:              ^^^^^^^^^^^^^^^^
[rank0]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/_flat_param.py", line 582, in __init__
[rank0]:     self._init_flat_param_and_metadata(
[rank0]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/_flat_param.py", line 632, in _init_flat_param_and_metadata
[rank0]:     ) = self._validate_tensors_to_flatten(params)
[rank0]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/_flat_param.py", line 768, in _validate_tensors_to_flatten
[rank0]:     raise ValueError("Cannot flatten integer dtype tensors")
[rank0]: ValueError: Cannot flatten integer dtype tensors
E0703 15:20:15.074000 140496818579264 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 2692379) of binary: /home/openwifi-lab2/miniconda3/envs/llama3_recipes/bin/python3.12
Traceback (most recent call last):
  File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
recipes/finetuning/finetuning.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-07-03_15:20:15
  host      : openwifilab2-desktop
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 2692380)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-03_15:20:15
  host      : openwifilab2-desktop
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2692379)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
1155157110 commented 44 minutes ago

Hi! --quantization will set load_in_8bit=True. Can you try --quantization without --fsdp_config.pure_bf16, let me know if there is any problem @1155157110. @LeonChengg

Thanks for your reply! However, the training script with --quantization while without --fsdp_config.pure_bf16 produces the following errors:

$ CUDA_VISIBLE_DEVICES=0,1 && torchrun --nnodes 1 --nproc_per_node 2 recipes/finetuning/finetuning.py --enable_fsdp --use_peft --peft_method lora --model_name ../Meta-Llama-3-8B/8B --batch_size_training 1 --dataset custom_dataset --custom_dataset.file "examples/chatbot_dataset.py:get_preprocessed_custom" --output_dir ../Meta-Llama-3-8B/chatbot\(finetuned_on_15k\)/epoch1 --num_epochs 1 --save_model --quantization
W0703 15:19:50.068000 140496818579264 torch/distributed/run.py:757] 
W0703 15:19:50.068000 140496818579264 torch/distributed/run.py:757] *****************************************
W0703 15:19:50.068000 140496818579264 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0703 15:19:50.068000 140496818579264 torch/distributed/run.py:757] *****************************************
Clearing GPU cache for all ranks
--> Running with torch dist debug set to detail
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:16<00:00,  4.17s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:16<00:00,  4.19s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.0424
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/openwifi-lab2/Desktop/llama/llama3/llama-recipes/recipes/finetuning/finetuning.py", line 8, in <module>
[rank1]:     fire.Fire(main)
[rank1]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/fire/core.py", line 143, in Fire
[rank1]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank1]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/fire/core.py", line 477, in _Fire
[rank1]:     component, remaining_args = _CallAndUpdateTrace(
[rank1]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank1]:     component = fn(*varargs, **kwargs)
[rank1]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/openwifi-lab2/Desktop/llama/llama3/llama-recipes/src/llama_recipes/finetuning.py", line 186, in main
[rank1]:     model = FSDP(
[rank1]:             ^^^^^
[rank1]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 485, in __init__
[rank1]:     _auto_wrap(
[rank1]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap
[rank1]:     _recursive_wrap(**recursive_wrap_kwargs, **root_kwargs)  # type: ignore[arg-type]
[rank1]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
[rank1]:     wrapped_child, num_wrapped_params = _recursive_wrap(
[rank1]:                                         ^^^^^^^^^^^^^^^^
[rank1]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
[rank1]:     wrapped_child, num_wrapped_params = _recursive_wrap(
[rank1]:                                         ^^^^^^^^^^^^^^^^
[rank1]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
[rank1]:     wrapped_child, num_wrapped_params = _recursive_wrap(
[rank1]:                                         ^^^^^^^^^^^^^^^^
[rank1]:   [Previous line repeated 6 more times]
[rank1]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap
[rank1]:     return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap
[rank1]:     return wrapper_cls(module, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 511, in __init__
[rank1]:     _init_param_handle_from_module(
[rank1]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/_init_utils.py", line 582, in _init_param_handle_from_module
[rank1]:     state.compute_device = _get_compute_device(
[rank1]:                            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/_init_utils.py", line 1045, in _get_compute_device
[rank1]:     raise ValueError(
[rank1]: ValueError: Inconsistent compute device and `device_id` on rank 1: cuda:0 vs cuda:1
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
--> Model ../Meta-Llama-3-8B/8B

--> ../Meta-Llama-3-8B/8B has 1050.939392 Million params

trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.0424
bFloat16 enabled for mixed precision - using bfSixteen policy
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/openwifi-lab2/Desktop/llama/llama3/llama-recipes/recipes/finetuning/finetuning.py", line 8, in <module>
[rank0]:     fire.Fire(main)
[rank0]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/fire/core.py", line 143, in Fire
[rank0]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank0]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/fire/core.py", line 477, in _Fire
[rank0]:     component, remaining_args = _CallAndUpdateTrace(
[rank0]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank0]:     component = fn(*varargs, **kwargs)
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/openwifi-lab2/Desktop/llama/llama3/llama-recipes/src/llama_recipes/finetuning.py", line 186, in main
[rank0]:     model = FSDP(
[rank0]:             ^^^^^
[rank0]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 485, in __init__
[rank0]:     _auto_wrap(
[rank0]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap
[rank0]:     _recursive_wrap(**recursive_wrap_kwargs, **root_kwargs)  # type: ignore[arg-type]
[rank0]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
[rank0]:     wrapped_child, num_wrapped_params = _recursive_wrap(
[rank0]:                                         ^^^^^^^^^^^^^^^^
[rank0]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
[rank0]:     wrapped_child, num_wrapped_params = _recursive_wrap(
[rank0]:                                         ^^^^^^^^^^^^^^^^
[rank0]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
[rank0]:     wrapped_child, num_wrapped_params = _recursive_wrap(
[rank0]:                                         ^^^^^^^^^^^^^^^^
[rank0]:   [Previous line repeated 2 more times]
[rank0]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap
[rank0]:     return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap
[rank0]:     return wrapper_cls(module, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 511, in __init__
[rank0]:     _init_param_handle_from_module(
[rank0]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/_init_utils.py", line 598, in _init_param_handle_from_module
[rank0]:     _init_param_handle_from_params(state, managed_params, fully_sharded_module)
[rank0]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/_init_utils.py", line 610, in _init_param_handle_from_params
[rank0]:     handle = FlatParamHandle(
[rank0]:              ^^^^^^^^^^^^^^^^
[rank0]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/_flat_param.py", line 582, in __init__
[rank0]:     self._init_flat_param_and_metadata(
[rank0]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/_flat_param.py", line 632, in _init_flat_param_and_metadata
[rank0]:     ) = self._validate_tensors_to_flatten(params)
[rank0]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/fsdp/_flat_param.py", line 768, in _validate_tensors_to_flatten
[rank0]:     raise ValueError("Cannot flatten integer dtype tensors")
[rank0]: ValueError: Cannot flatten integer dtype tensors
E0703 15:20:15.074000 140496818579264 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 2692379) of binary: /home/openwifi-lab2/miniconda3/envs/llama3_recipes/bin/python3.12
Traceback (most recent call last):
  File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/openwifi-lab2/miniconda3/envs/llama3_recipes/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
recipes/finetuning/finetuning.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-07-03_15:20:15
  host      : openwifilab2-desktop
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 2692380)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-03_15:20:15
  host      : openwifilab2-desktop
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2692379)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

I found similar errors ( ValueError: Cannot flatten integer dtype tensors #240 ) in llama recipes issues. I tried their solution (turning off load_in_8bit, but still training is not working. With load_in_8bit off, the training process runs up all the GPU memory of one GPU card, leaving the other one unused.