meta-llama / llama-recipes

Scripts for fine-tuning Meta Llama3 with composable FSDP & PEFT methods to cover single/multi-node GPUs. Supports default & custom datasets for applications such as summarization and Q&A. Supporting a number of candid inference solutions such as HF TGI, VLLM for local or cloud deployment. Demo apps to showcase Meta Llama3 for WhatsApp & Messenger.
11.44k stars 1.62k forks source link

ValueError: Cannot flatten integer dtype tensors #240

Closed humza-sami closed 1 week ago

humza-sami commented 10 months ago

System Info

pytorch 2.1.0+cu121 4xA4000 GPUs

Information

🐛 Describe the bug

I am trying to run examples/finetuning.py script without any changes but its giving me following error.

Command:

torchrun --nnodes 1 --nproc_per_node 4 examples/finetuning.py --model_name ../CodeLlama-7b-Instruct/hug --use_peft --peft_method lora --use_fp16 --output_dir ../output --enable_fsdp

results:

Error logs

trainable params: 4,194,304 || all params: 6,742,740,992 || trainable%: 0.06220473254091146                                                                                                                        
Traceback (most recent call last):                                                                                                                                                                                 
  File "examples/finetuning.py", line 8, in <module>                                                                                                                                                               
    fire.Fire(main)                                                                                                                                                                                                
  File "/root/primisai/vast_ai_envoirment/lib/python3.8/site-packages/fire/core.py", line 141, in Fire                                                                                                             
    component_trace = _Fire(component, args, parsed_flag_args, context, name)                                                                                                                                      
  File "/root/primisai/vast_ai_envoirment/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire                                                                                                            
    component, remaining_args = _CallAndUpdateTrace(                                                                                                                                                               
  File "/root/primisai/vast_ai_envoirment/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace                                                                                              
    component = fn(*varargs, **kwargs)                                                                                                                                                                             
  File "/root/primisai/vast_ai_envoirment/lib/python3.8/site-packages/llama_recipes/finetuning.py", line 144, in main                                                                                              
    model = FSDP(                                                                                                                                                                                                  
  File "/root/primisai/vast_ai_envoirment/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 463, in __init__                                                                
    _auto_wrap(                                                                                                                                                                                                    
  File "/root/primisai/vast_ai_envoirment/lib/python3.8/site-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap                                                                              
    _recursive_wrap(**recursive_wrap_kwargs, **root_kwargs)  # type: ignore[arg-type]                                                                                                                              
  File "/root/primisai/vast_ai_envoirment/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 537, in _recursive_wrap                                                                                
    wrapped_child, num_wrapped_params = _recursive_wrap(                                                                                                                                                           
  File "/root/primisai/vast_ai_envoirment/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 537, in _recursive_wrap                                                                                
    wrapped_child, num_wrapped_params = _recursive_wrap(                                                                                                                                                           
  File "/root/primisai/vast_ai_envoirment/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 537, in _recursive_wrap                                                                                
    wrapped_child, num_wrapped_params = _recursive_wrap(                                                                                                                                                           
  [Previous line repeated 2 more times]                                                                                                                                                                            
  File "/root/primisai/vast_ai_envoirment/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 555, in _recursive_wrap                                                                                
    return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel                                                                                                                                                  
  File "/root/primisai/vast_ai_envoirment/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 484, in _wrap                                                                                          
    return wrapper_cls(module, **kwargs)                                                                                                                                                                           
  File "/root/primisai/vast_ai_envoirment/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 487, in __init__                                                                
    _init_param_handle_from_module(                                                                                                                                                                                
  File "/root/primisai/vast_ai_envoirment/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py", line 519, in _init_param_handle_from_module                                                          
    _init_param_handle_from_params(state, managed_params, fully_sharded_module)                                                                                                                                    
  File "/root/primisai/vast_ai_envoirment/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py", line 531, in _init_param_handle_from_params                                                          
    handle = FlatParamHandle(                                                                                                                                                                                      
  File "/root/primisai/vast_ai_envoirment/lib/python3.8/site-packages/torch/distributed/fsdp/flat_param.py", line 537, in __init__                                                                                 
    self._init_flat_param_and_metadata(                                                                                                                                                                            
  File "/root/primisai/vast_ai_envoirment/lib/python3.8/site-packages/torch/distributed/fsdp/flat_param.py", line 585, in _init_flat_param_and_metadata                                                            
    ) = self._validate_tensors_to_flatten(params)
  File "/root/primisai/vast_ai_envoirment/lib/python3.8/site-packages/torch/distributed/fsdp/flat_param.py", line 720, in _validate_tensors_to_flatten
    raise ValueError("Cannot flatten integer dtype tensors")
ValueError: Cannot flatten integer dtype tensors

Expected behavior

I am expecting model training without any issue since I have not changed anything

humza-sami commented 10 months ago

@HamidShojanazeri Please check

HamidShojanazeri commented 10 months ago

@Humza1996 will take a look however, we haven't tested code llama fine-tuning in the recipes yet, so not sure if would work out of the box.

lihkinVerma commented 10 months ago

Facing same error

vTuanpham commented 9 months ago

Seem to be related to bitsandbytes, turn off load_in_4bit or load_in_8bit and seem to be working correctly

Flemington8 commented 3 months ago

Facing same error

wukaixingxp commented 2 months ago

Hi! It seems that the FSDP work with Qlora now. While we are working to add more documents about this soon, for now, please check the example script here.

HamidShojanazeri commented 2 months ago

@Flemington7 , the code llama has not been tested but for one thing, wonder if you are running into same issue with --pure_bf16 ? BTW just to note if you are looking for code assistant/ generation applications, llama3 by itself is very performant in that space, you won't need code llama for this case. Infilling and code completion still requires code llama.

ghost commented 2 months ago

Hi! It seems that the FSDP work with Qlora now. While we are working to add more documents about this soon, for now, please check the example script here.

I got same issue though using the script, run_peft_qlora_fsdp.sh. Which parameter do I need to change to make it work?

init27 commented 1 week ago

Closing as this was solved here please feel free to re-open if you have any questions or a different issue.

Thanks!