huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.76k stars 941 forks source link

Accelerate test fails: Exception: Could not find the transformer layer class to wrap in the model #2872

Closed MikaSie closed 2 months ago

MikaSie commented 3 months ago

System Info

`Accelerate` version: 0.31.0
- Platform: Linux-5.4.0-171-generic-x86_64-with-glibc2.35
- `accelerate` bash location: /workspace/Thesis/venv/bin/accelerate
- Python version: 3.10.13
- Numpy version: 1.26.2
- PyTorch version (GPU?): 2.3.1+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 31.28 GB
- GPU type: NVIDIA A40
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: FSDP
        - mixed_precision: bf16
        - use_cpu: False
        - debug: False
        - num_processes: 4
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - fsdp_config: {'fsdp_activation_checkpointing': False, 'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_backward_prefetch': 'BACKWARD_PRE', 'fsdp_cpu_ram_efficient_loading': True, 'fsdp_forward_prefetch': False, 'fsdp_offload_params': False, 'fsdp_sharding_strategy': 'FULL_SHARD', 'fsdp_state_dict_type': 'SHARDED_STATE_DICT', 'fsdp_sync_module_states': True, 'fsdp_transformer_layer_cls_to_wrap': 'LlamaDecoderLayer, LlamaMLP', 'fsdp_use_orig_params': False}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

Tasks

Reproduction

Packages

transformers==4.41.2
peft==0.11.1
datasets==2.20.0
accelerate==0.31.0
evaluate==0.4.1
bitsandbytes==0.43.1
huggingface_hub==0.23.4
trl==0.9.4

Code

quantization_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_storage=torch.bfloat16,
            )

device_index = Accelerator().process_index
device_map = {"": device_index}

model = AutoModelForCausalLM.from_pretrained(
            'meta-llama/Meta-Llama-3-8B', 
            device_map=device_map,
            quantization_config=quantization_config,
            torch_dtype=torch.bfloat16,
            attn_implementation="flash_attention_2",
            use_cache=False 
            )

tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3-8B')
tokenizer.pad_token = tokenizer.eos_token

lora_config = LoraConfig(
            r= 8,
            lora_alpha=16,
            lora_dropout=0.1,
            target_modules = ["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
            task_type= 'CAUSAL_LM',
            bias= 'none',

        )

training_args = TrainingArguments(
            output_dir = os.path.join('results', model_id, 'output'),
            num_train_epochs = 40,
            per_device_train_batch_size = 1,
            per_device_eval_batch_size = 1, 
            gradient_accumulation_steps = True,
            warmup_ratio = args.warmup_ratio,
            weight_decay = args.weight_decay,
            logging_dir = os.path.join('results', model_id, 'logs'),
            remove_unused_columns = False,        
            load_best_model_at_end = True,
            metric_for_best_model = True,
            save_strategy= "epoch",
            save_total_limit= 2,
            evaluation_strategy = "epoch",
            label_names=["labels"],
            report_to = "wandb",
            logging_strategy = "epoch",
            run_name = model_id,
            eval_accumulation_steps = 1,
            hub_model_id = f"{model_id}",
            gradient_checkpointing= True,
            fp16= args.fp16,
            bf16= args.bf16,
            gradient_checkpointing_kwargs= {'use_reentrant': True},
        )

trainer = SFTTrainer(
            model = model, 
            tokenizer = tokenizer, 
            args = training_args,
            train_dataset = dataset["train"],
            eval_dataset = dataset["validation"],
            max_seq_length = context_length_abstractive_model, #8192 
            callbacks = [EarlyStoppingCallback(early_stopping_patience = args.early_stopping_patience)],
            peft_config = lora_config,
            packing= True
            )

 if getattr(trainer.accelerator.state, "fsdp_plugin", None):
            from peft.utils.other import fsdp_auto_wrap_policy
            fsdp_plugin = trainer.accelerator.state.fsdp_plugin
            fsdp_plugin.auto_wrap_policy = fsdp_auto_wrap_policy(trainer.model)

trainer.train()
print("Training done")
if trainer.is_fsdp_enabled:
     trainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT")

trainer.save_model(output_dir = os.path.join('results', model_id, 'model'))
accelerate launch training.py --bf16
accelerate test

Expected behavior

tldr;

I'm encountering an error when running accelerate test. This happens when running accelerate test with and without setting fsdp_transformer_layer_cls_to_wrap toLlamaDecoder, LlamaMLP. The same issue seems to happen when I run my training script:

Exception: Could not find the transformer layer class to wrap in the model.

ValueError: You cannot perform fine-tuning on purely quantized models. Please attach trainable adapters on top of the quantized model to correctly perform fine-tuning. Please see: https://huggingface.co/docs/transformers/peft for more details

Introduction

Hi! I'm trying to fine-tune LLama3-8B on a summarization dataset of about 1500 instances. The dataset contains long documents, often over 8K tokens. I want to use FSDP + QLORA to try and finetune LLama3 8B.

I'm following these guides as inspiration: bitsandbytes Guide Phil Schmid Guide Huggingface Accelerate Guide Huggingface PEFT Guide Huggingface Bitsandbytes Guide

First issue:

At first, I didn't set fsdp_transformer_layer_cls_to_wrap as it isn't defined in the mentioned guides. And with my training script I am able to start training. But, then I'm unable to save the model or push it to the hub after training. I have had a full training run before and could see the loss reducing. The yaml file looks as follows in this scenario:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
  fsdp_activation_checkpointing: false
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: false
  fsdp_offload_params: false
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

When running :

accelerate launch training.py --bf16

I get the following output

[rank0]:W0619 14:54:29.194000 140411100997440 torch/distributed/fsdp/_state_dict_utils.py:622] Did not find param with FQN _fsdp_wrapped_module.base_model.model.model.layers.31._fsdp_wrapped_module.post_attention_layernorm.weight, skipping it. The weight will not be filled if you expe
ct it to be.

{'train_runtime': 75.9576, 'train_samples_per_second': 0.211, 'train_steps_per_second': 0.013, 'train_loss': 14.044451713562012, 'epoch': 0.01}                                                                                                                                              
Training done
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [01:13<00:00, 44.10s/it]
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
Traceback (most recent call last):                                                                                                                                                                                                                                                           
  File "/workspace/Thesis/training.py", line 709, in <module>                                                                                                                                                                                                                                
    trainer.train()                                                                                                                                                                                                                                                                          
  File "/workspace/Thesis/venv/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 440, in train                                                                                                                                                                                  
    output = super().train(*args, **kwargs)                                                                                                                                                                                                                                                  
  File "/workspace/Thesis/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1932, in train                                                                                                                                                                                    
    return inner_training_loop(                                                                                                                                                                                                                                                              
  File "/workspace/Thesis/venv/lib/python3.10/site-packages/transformers/trainer.py", line 2427, in _inner_training_loop                                                                                                                                                                     
    self.control = self.callback_handler.on_train_end(args, self.state, self.control)                                                                                                                                                                                                        
  File "/workspace/Thesis/venv/lib/python3.10/site-packages/transformers/trainer_callback.py", line 464, in on_train_end                                                                                                                                                                     
    return self.call_event("on_train_end", args, state, control)                                                                                                                                                                                                                             
  File "/workspace/Thesis/venv/lib/python3.10/site-packages/transformers/trainer_callback.py", line 508, in call_event                                                                                                                                                                       
    result = getattr(callback, event)(                                                                                                                                                                                                                                                       
  File "/workspace/Thesis/venv/lib/python3.10/site-packages/transformers/integrations/integration_utils.py", line 850, in on_train_end                                                                                                                                                       
    fake_trainer = Trainer(args=args, model=model, tokenizer=tokenizer)                                                                                                                                                                                                                      
  File "/workspace/Thesis/venv/lib/python3.10/site-packages/transformers/trainer.py", line 476, in __init__                                                                                                                                                                                  
    raise ValueError(                                                                                                                                                                                                                                                                        
ValueError: You cannot perform fine-tuning on purely quantized models. Please attach trainable adapters on top of the quantized model to correctly perform fine-tuning. Please see: https://huggingface.co/docs/transformers/peft for more details       

There is a warning, a print statement and finally the ValueError. The warning is repeated many times and seems to come from here.

It seems like the model isn't wrapped properly or that something is wrong with the quantization, as the ValueError indicates. But I don't think it has anything to do with quantization because when I print my model I see the following:

PeftModelForCausalLM(                                                                                                                                                                                                                                                         
  (base_model): LoraModel(                                                                                                                                                                                                                                                                   
    (model): LlamaForCausalLM(                                                                                                                                                                                                                                                               
      (model): LlamaModel(                                                                                                                                                                                                                                                                   
        (embed_tokens): Embedding(128256, 4096)                                                                                                                                                                                                                                              
        (layers): ModuleList(                                                                                                                                                                                                                                                                
          (0-31): 32 x LlamaDecoderLayer(                                                                                                                                                                                                                                                    
            (self_attn): LlamaFlashAttention2(                                                                                                                                                                                                                                               
              (q_proj): lora.Linear4bit(                                                                                                                                                                                                                                                     
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)                                                                                                                                                                                                    
                (lora_dropout): ModuleDict(                                                                                                                                                                                                                                                  
                  (default): Dropout(p=0.1, inplace=False)                                                                                                                                                                                                                                   
                )                                                                                                                                                                                                                                                                            
                (lora_A): ModuleDict(                                                                                                                                                                                                                                                        
                  (default): Linear(in_features=4096, out_features=8, bias=False)                                                                                                                                                                                                            
                )                                                                                                                                                                                                                                                                            
                (lora_B): ModuleDict(                                                                                                                                                                                                                                                        
                  (default): Linear(in_features=8, out_features=4096, bias=False)                                                                                                                                                                                                            
                )                                                                                                                                                                                                                                                                            
                (lora_embedding_A): ParameterDict()                                                                                                                                                                                                                                          
                (lora_embedding_B): ParameterDict()                                                                                                                                                                                                                                          
              )                                                                                                                                                                                                                                                                              
              (k_proj): lora.Linear4bit(                                                                                                                                                                                                                                                     
                (base_layer): Linear4bit(in_features=4096, out_features=1024, bias=False)                                                                                                                                                                                                    
                (lora_dropout): ModuleDict(                                                                                                                                                                                                                                                  
                  (default): Dropout(p=0.1, inplace=False)                                                                                                                                                                                                                                   
                )                                                                                                                                                                                                                                                                            
                (lora_A): ModuleDict(                                                                                                                                                                                                                                                        
                  (default): Linear(in_features=4096, out_features=8, bias=False)                                                                                                                                                                                                            
                )                                                                                                                                                                                                                                                                            
                (lora_B): ModuleDict(                                                                                                                                                                                                                                                        
                  (default): Linear(in_features=8, out_features=1024, bias=False)                                                                                                                                                                                                            
                )                                                                                                                                                                                                                                                                            
                (lora_embedding_A): ParameterDict()                                                                                                                                                                                                                                          
                (lora_embedding_B): ParameterDict()                                                                                                                                                                                                                                          
              )                                                                                                                                                                                                                                                                              
              (v_proj): lora.Linear4bit(                                                                                                                                                                                                                                                     
                (base_layer): Linear4bit(in_features=4096, out_features=1024, bias=False)                                                                                                                                                                                                    
                (lora_dropout): ModuleDict(                                                                                                                                                                                                                                                  
                  (default): Dropout(p=0.1, inplace=False)                               

                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (rotary_emb): LlamaRotaryEmbedding()
            )
            (mlp): LlamaMLP(
              (gate_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=14336, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=14336, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (up_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=14336, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=14336, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (down_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=14336, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=14336, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (act_fn): SiLU()
            )
            (input_layernorm): LlamaRMSNorm()
            (post_attention_layernorm): LlamaRMSNorm()
          )
        )
        (norm): LlamaRMSNorm()
      )
      (lm_head): Linear(in_features=4096, out_features=128256, bias=False)
    )
  )
)

Second issue:

I couldn't really figure out what this issue is and after a while I decied to check out if my accelerate config was set up properly, so I decided to run

accelerate test

And I get the following error:

stderr: [rank2]: Traceback (most recent call last):                                                                                                                                                                                                                                          
stderr: [rank2]:   File "/workspace/Thesis/venv/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py", line 826, in <module>                                                                                                                                            
stderr: [rank2]:     main()                                                                                                                                                                                                                                                                  
stderr: [rank2]:   File "/workspace/Thesis/venv/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py", line 812, in main                                                                                                                                                
stderr: [rank2]:     training_check(use_seedable_sampler=False)                                                                                                                                                                                                                              
stderr: [rank2]:   File "/workspace/Thesis/venv/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py", line 435, in training_check                                                                                                                                      
stderr: [rank2]:     train_dl, model, optimizer = accelerator.prepare(train_dl, model, optimizer)                                                                                                                                                                                            
stderr: [rank2]:   File "/workspace/Thesis/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1299, in prepare                                                                                                                                                               
stderr: [rank2]:     result = tuple(                                                                                                                                                                                                                                                         
stderr: [rank2]:   File "/workspace/Thesis/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1300, in <genexpr>                                                                                                                                                             
stderr: [rank2]:     self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)                                                                                                                                                                   
stderr: [rank2]:   File "/workspace/Thesis/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1176, in _prepare_one                                                                                                                                                          
stderr: [rank2]:     return self.prepare_model(obj, device_placement=device_placement)                                                                                                                                                                                                       
stderr: [rank2]:   File "/workspace/Thesis/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1450, in prepare_model                                                                                                                                                         
stderr: [rank2]:     self.state.fsdp_plugin.set_auto_wrap_policy(model)                                                                                                                                                                                                                      
stderr: [rank2]:   File "/workspace/Thesis/venv/lib/python3.10/site-packages/accelerate/utils/dataclasses.py", line 1189, in set_auto_wrap_policy                                                                                                                                            
stderr: [rank2]:     raise Exception("Could not find the transformer layer class to wrap in the model.")                                                                                                                                                                                     
stderr: [rank2]: Exception: Could not find the transformer layer class to wrap in the model.          

I thought the issue was that I didn't set fsdp_transformer_layer_cls_to_wrap. So after checking out https://github.com/huggingface/accelerate/pull/1947 and https://github.com/tatsu-lab/stanford_alpaca/issues/58 . I figured that this was the issue and set fsdp_transformer_layer_cls_to_wrap to LlamaDecoderLayer, LlamaMLP (Like in the system info above). I thought this would solve the issue but when I run:

accelerate launch training.py --bf16

This error occurs:

[rank2]: Traceback (most recent call last):
[rank2]:   File "/workspace/Thesis/training.py", line 682, in <module>
[rank2]:     fsdp_plugin.auto_wrap_policy = fsdp_auto_wrap_policy(trainer.model)
[rank2]:   File "/workspace/Thesis/venv/lib/python3.10/site-packages/peft/utils/other.py", line 428, in fsdp_auto_wrap_policy
[rank2]:     raise Exception("Could not find the transformer layer class to wrap in the model.")
[rank2]: Exception: Could not find the transformer layer class to wrap in the model.

This is again the same issue that occurs when running accelerate test with and without setting fsdp_transformer_layer_cls_to_wrap:

[rank3]: Traceback (most recent call last):
[rank3]:   File "/workspace/Thesis/venv/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py", line 826, in <module>
[rank3]:     main()
[rank3]:   File "/workspace/Thesis/venv/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py", line 812, in main
[rank3]:     training_check(use_seedable_sampler=False)
[rank3]:   File "/workspace/Thesis/venv/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py", line 435, in training_check
[rank3]:     train_dl, model, optimizer = accelerator.prepare(train_dl, model, optimizer)
[rank3]:   File "/workspace/Thesis/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1299, in prepare
[rank3]:     result = tuple(
[rank3]:   File "/workspace/Thesis/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1300, in <genexpr>
[rank3]:     self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank3]:   File "/workspace/Thesis/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1176, in _prepare_one
[rank3]:     return self.prepare_model(obj, device_placement=device_placement)
[rank3]:   File "/workspace/Thesis/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1450, in prepare_model
[rank3]:     self.state.fsdp_plugin.set_auto_wrap_policy(model)
[rank3]:   File "/workspace/Thesis/venv/lib/python3.10/site-packages/accelerate/utils/dataclasses.py", line 1189, in set_auto_wrap_policy
[rank3]:     raise Exception("Could not find the transformer layer class to wrap in the model.")
[rank3]: Exception: Could not find the transformer layer class to wrap in the model.

As you can see the issue has something to do with that the transformer layers cannot be found.

I think that this has something to do with how the model is wrapped in FSDP but I'm not sure what is going on.

Please let me know if you have any idea as I'm getting stuck in as of now! I also find it quite odd that accelerate test doesn't work with and without the setting fsdp_transformer_layer_cls_to_wrap.

MikaSie commented 3 months ago

@muellerzr I have found an issue with accelerate test, it seems like the issue with my training script has something to do with transformers:

I saw that fsdp_transformer_layer_cls_to_wrap is depreciated. It seems like FSDP_TRANSFORMER_CLS_TO_WRAP isn't set in this file. See code snippet:

def set_auto_wrap_policy(self, model):
        from torch.distributed.fsdp.wrap import size_based_auto_wrap_policy, transformer_auto_wrap_policy

        default_transformer_cls_names_to_wrap = (
            ",".join(model._no_split_modules) if getattr(model, "_no_split_modules", None) is not None else ""
        )
        if self.auto_wrap_policy is None:
            auto_wrap_policy = os.environ.get("FSDP_AUTO_WRAP_POLICY", "NO_WRAP")
            if auto_wrap_policy == FSDP_AUTO_WRAP_POLICY[0]:
                transformer_cls_names_to_wrap = os.environ.get(
                    "FSDP_TRANSFORMER_CLS_TO_WRAP", default_transformer_cls_names_to_wrap
                ).split(",")
                transformer_cls_to_wrap = set()
                for layer_class in transformer_cls_names_to_wrap:
                    transformer_cls = get_module_class_from_name(model, layer_class)
                    if transformer_cls is None:
                        raise Exception("Could not find the transformer layer class to wrap in the model.")
                    else:
                        transformer_cls_to_wrap.add(transformer_cls)

Maybe it is good to remove FSDP_TRANSFORMER_CLS_TO_WRAP ? It isn't imported from .constants while FSDP_AUTO_WRAP_POLICY is imported from constants.

I ran through the code and it seems like that set_auto_wrap_policy works fine when a model is given to it. I've tried by giving it LLama3 as the model and LlamaDecoderLayer is properly returned. So it seems like the issue in my personal training script is not with accelerate but with transformers.

But then, the isssue for accelerate test will still exist. It seems like it fails because the Regression model that is given to accelerator.prepare() doesn't have any _no_split_modules as can be seen here.

Might be good to fix this as running accelerate test won't work properly this way. This gives the false impression that the accelerator config isn't setup properly while in fact it is (probably) set up just fine! I think this just happens when FSDP is used as this will search for the _no_split_modules. Could be an idea to just pass the regression model a basic BertLayer as the _no_split_modules to make the test pass.

BenjaminBossan commented 3 months ago

Thanks for this detailed report. Debugging this type of issue can be really difficult, props for trying out a bunch of different things.

At a first glance, I can't spot any obvious mistakes in your script. The model repr also looks good. From my own testing (same model and LoRA config), I can tell that the PEFT examples for FSDP QLoRA that you linked work for me (they use TRANSFORMER_BASED_WRAP). Could you check and confirm that they work for you too or are you already encountering issues with that example?

One difference that I spotted is that I used "bnb_4bit_compute_dtype": "float32", "bnb_4bit_quant_storage": "float32", but that's because the machine I used has no support for bf16. Given the error you report, that's unlikely to be the reason.

One idea to debug this issue further: Could you edit your local accelerate code in this line and print the values for model and transformer_cls_names_to_wrap?

MikaSie commented 3 months ago

Thanks for your response!

I have just tested the Huggingface Bitsandbytes Guide . This script seems to work (and wow it is much better than mine haha). I use LLama3-8B and everything can be saved locally, which is where my script fails, sending a callback to trainer.train().

I think I have found in my script where the issue is. I believe it's in PEFT and for some reason the model isn't seen as a PEFT model. When I use from transformers.trainer import _is_peft_model on my model, I see that False is returned.

This causes my script to error on this line in the trainer.py file.

For some reason my model, that is given to the SFTTrainer alongside the LoraConfig, isn't truly a PEFT model. I have added this snippet in the train.py file of the guide

trainer.accelerator.print(f"{trainer.model}")
trainer.model.print_trainable_parameters()
print(_is_peft_model(trainer.model))

This returns True!! So it seems that because my model isn't a PEFT model, that it can't be saved.

I'm not sure yet why my model isn't a PEFT model but I'm guessing this has something to do with how I instantiate the model in my script:

device_index = Accelerator().process_index
device_map = {"": device_index}

model = AutoModelForCausalLM.from_pretrained(
            'meta-llama/Meta-Llama-3-8B', 
            device_map=device_map,
            quantization_config=quantization_config,
            torch_dtype=torch.bfloat16,
            attn_implementation="flash_attention_2",
            use_cache=False 
            )

I'm guessing that the device_map but more specifically the Accelerator().process_index breaks something. I already saw in every guide that the device_map isn't set. But if I run my script without setting the Accelerator().process_index, my script can't run as I get an error:

ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example `device_map={'':torch.cuda.current_device()}` or `device_map={'':torch.xpu.current_device()}`

I had encountered this before and deciced to prevent this ValueError by using this and this.

I find it odd that Accelerator().process_index is required, as it isn't used in any of the guides I referenced. I also noticed that my nvidia-smi looks like this:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A40                     On  | 00000000:01:00.0 Off |                    0 |
|  0%   28C    P0              68W / 300W |   3768MiB / 46068MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A40                     On  | 00000000:02:00.0 Off |                    0 |
|  0%   28C    P0              67W / 300W |   2968MiB / 46068MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A40                     On  | 00000000:03:00.0 Off |                    0 |
|  0%   29C    P0              68W / 300W |   2968MiB / 46068MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A40                     On  | 00000000:04:00.0 Off |                    0 |
|  0%   29C    P0              70W / 300W |   2968MiB / 46068MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

When I compared it to the nvidia-smi of the Huggingface guide, I saw that all gpu's were perfectly divided; i.e. every gPU had the exact same amount of memory usage. This could of course also be something else in my script which is set on GPU 0.

Concerning the accelerate test issue: I'll try and do that asap. Currently, I'm finishing my master's Thesis and really need to get LLama3 working haha! If I get the model training, I'll see what that print statement will return and come back at you.

BenjaminBossan commented 3 months ago

This script seems to work (and wow it is much better than mine haha). I use LLama3-8B and everything can be saved locally, which is where my script fails, sending a callback to trainer.train().

Glad that you got it to work.

I have just tested the Huggingface Bitsandbytes Guide

I have added this snippet in the train.py file of the guide

I'm confused, first you reference the bnb guide but then you mention the PEFT guide. Which one was the one that worked? This might be of interest for future users who encounter the same issue.

For some reason my model, that is given to the SFTTrainer alongside the LoraConfig, isn't truly a PEFT model.

How did you test that? The model is not a PEFT model after you call AutoModelForCausalLM.from_pretrained(...) but that is expected. As you pass the LoraConfig to SFTTrainer, it should be transformed into a PEFT model under the hood though.

have added this snippet [...] This returns True!!

And when you run the same snippet in your script, does it return False?

I already saw in every guide that the device_map isn't set. But if I run my script without setting the Accelerator().process_index, my script can't run as I get an error:

Hmm, I can't really help you with that, hopefully @muellerzr can tell us more once he's back at work.

Concerning the accelerate test issue: I'll try and do that asap. Currently, I'm finishing my master's Thesis and really need to get LLama3 working haha! If I get the model training, I'll see what that print statement will return and come back at you.

Don't worry too much about that. As long as you get your model to train, we're good.

MikaSie commented 3 months ago

Hi!

I've added the print statements in the line you asked. The print staments are as follows:

   def set_auto_wrap_policy(self, model):
        from torch.distributed.fsdp.wrap import size_based_auto_wrap_policy, transformer_auto_wrap_policy

        default_transformer_cls_names_to_wrap = (
            ",".join(model._no_split_modules) if getattr(model, "_no_split_modules", None) is not None else ""
        )
        if self.auto_wrap_policy is None:
            auto_wrap_policy = os.environ.get("FSDP_AUTO_WRAP_POLICY", "NO_WRAP")
            if auto_wrap_policy == FSDP_AUTO_WRAP_POLICY[0]:
                transformer_cls_names_to_wrap = os.environ.get(
                    "FSDP_TRANSFORMER_CLS_TO_WRAP", default_transformer_cls_names_to_wrap
                ).split(",")
                transformer_cls_to_wrap = set()
                print(f'transformer_cls_names_to_wrap: {transformer_cls_names_to_wrap}')
                print(f"Model: {model}")

                for layer_class in transformer_cls_names_to_wrap:
                    transformer_cls = get_module_class_from_name(model, layer_class)
                    if transformer_cls is None:
                        raise Exception("Could not find the transformer layer class to wrap in the model.")
                    else:
                        transformer_cls_to_wrap.add(transformer_cls)

With the following output when I run accelerate test:

stdout: Model: RegressionModel()
stdout: transformer_cls_names_to_wrap: ['']

So, it seems that the list is empty and therefore transformer_cls is set to None and the error is raised.

MikaSie commented 3 months ago

I'm confused, first you reference the bnb guide but then you mention the PEFT guide. Which one was the one that worked? This might be of interest for future users who encounter the same issue.

I used the bits and bytes guide which actually uses the PEFT example repo. It seems that both guides work as they reference the same example in the PEFT repo.

I added a print statement here in train.py:

trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        peft_config=peft_config,
        packing=data_args.packing,
        dataset_kwargs={
            "append_concat_token": data_args.append_concat_token,
            "add_special_tokens": data_args.add_special_tokens,
        },
        dataset_text_field=data_args.dataset_text_field,
        max_seq_length=data_args.max_seq_length,
    )
trainer.accelerator.print(f"{trainer.model}")
trainer.model.print_trainable_parameters()
print(_is_peft_model(trainer.model))

This is the example from the PEFT repo which the BNB guide references.

In my own script I added the following print statement:

trainer = SFTTrainer(
            model = model, 
            tokenizer = tokenizer, 
            args = training_args,
            train_dataset = dataset["train"],
            eval_dataset = dataset["validation"].select(range(4)), #For testing purposes
            max_seq_length = context_length_abstractive_model, #8192
            callbacks = [EarlyStoppingCallback(early_stopping_patience = args.early_stopping_patience)],
            peft_config = lora_config,
            packing= True,
            )

if getattr(trainer.accelerator.state, "fsdp_plugin", None):
       from peft.utils.other import fsdp_auto_wrap_policy
       print('Changing auto wrap policy for FSDP')
       fsdp_plugin = trainer.accelerator.state.fsdp_plugin
       fsdp_plugin.auto_wrap_policy = fsdp_auto_wrap_policy(trainer.model)

print(f"PEFT model: {_is_peft_model(trainer.model)}")

This also prints True. I made a small mistake where I checked the status of model instead of the trainer.model. The model returned false while trainer.model returns True. So it seems that wrapping LLama as a PEFT model does work!

I'm a bit lost as to what goes wrong when saving the model. For some reason it seems that the trainer.model isn't recognised as a PEFT model, which is very contradictory as the print(f"PEFT model: {_is_peft_model(trainer.model)}") returns True.

This causes the ValueError being raised here.

MikaSie commented 3 months ago

Here I am again haha!

But I have good news, I found the issue with my training script: The wandb callback caused the ValueError because it sets the model to None, causing the error as the model is indeed not a PeftModel anymore

When I removed the wandb reporting from my training_args and did not set the environment such as:

os.environ["WANDB_PROJECT"] = "PROJECT_NAME"
os.environ["WANDB_LOG_MODEL"] = "end"

Then, I am able to save the model!! Let me show you the issue, here is the full traceback:

Traceback (most recent call last):                                                                                                                                                                                                                                                           
  File "/workspace/Thesis/training.py", line 709, in <module>                                                                                                                                                                                                                                
    trainer.train()                                                                                                                                                                                                                                                                          
  File "/workspace/Thesis/venv/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 440, in train                                                                                                                                                                                  
    output = super().train(*args, **kwargs)                                                                                                                                                                                                                                                  
  File "/workspace/Thesis/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1932, in train                                                                                                                                                                                    
    return inner_training_loop(                                                                                                                                                                                                                                                              
  File "/workspace/Thesis/venv/lib/python3.10/site-packages/transformers/trainer.py", line 2427, in _inner_training_loop                                                                                                                                                                     
    self.control = self.callback_handler.on_train_end(args, self.state, self.control)                                                                                                                                                                                                        
  File "/workspace/Thesis/venv/lib/python3.10/site-packages/transformers/trainer_callback.py", line 464, in on_train_end                                                                                                                                                                     
    return self.call_event("on_train_end", args, state, control)                                                                                                                                                                                                                             
  File "/workspace/Thesis/venv/lib/python3.10/site-packages/transformers/trainer_callback.py", line 508, in call_event                                                                                                                                                                       
    result = getattr(callback, event)(                                                                                                                                                                                                                                                       
  File "/workspace/Thesis/venv/lib/python3.10/site-packages/transformers/integrations/integration_utils.py", line 850, in on_train_end                                                                                                                                                       
    fake_trainer = Trainer(args=args, model=model, tokenizer=tokenizer)                                                                                                                                                                                                                      
  File "/workspace/Thesis/venv/lib/python3.10/site-packages/transformers/trainer.py", line 476, in __init__                                                                                                                                                                                  
    raise ValueError(                                                                                                                                                                                                                                                                        
ValueError: You cannot perform fine-tuning on purely quantized models. Please attach trainable adapters on top of the quantized model to correctly perform fine-tuning. Please see: https://huggingface.co/docs/transformers/peft for more details       

You can see that trainer_callback.py is called when the model is done training:

File "/workspace/Thesis/venv/lib/python3.10/site-packages/transformers/trainer_callback.py", line 464, in on_train_end                                                                                                                                                                     
    return self.call_event("on_train_end", args, state, control)

Then, eventually, integration_utils.py is called and a fake_trainer is created here. But as you can see in this code snippet, the Model is set to None, because the model isn't given as an argument in call_event("on_train_end", args, state, control). Then, a new Trainer is instantiated but the model is set to None:

def on_train_end(self, args, state, control, model=None, tokenizer=None, **kwargs):
        if self._wandb is None:
            return
        if self._log_model in ("end", "checkpoint") and self._initialized and state.is_world_process_zero:
            from ..trainer import Trainer

            fake_trainer = Trainer(args=args, model=model, tokenizer=tokenizer)

This brings us back to the trainer.py file. And because Model is set to None, the following if-statement will be true and the ValueError will be raised:

        if _is_quantized_and_base_model and not _is_peft_model(model):
            raise ValueError(
                "You cannot perform fine-tuning on purely quantized models. Please attach trainable adapters on top of"
                " the quantized model to correctly perform fine-tuning. Please see: https://huggingface.co/docs/transformers/peft"
                " for more details"
            )

I'm not too sure as I haven't debugged the issue yet, but I think if we just pass model in self.call_event("on_train_end", args, state, control), then this issue won't occur!

This also make sense as to why the guides work, because all of them don't use wandb! It could also be a good idea to mention this in the guides, but fixing it would be even better :)! Let me know what you think of this issue and my solution, we could open a PR.

Please, also let me know what we need to do with the accelerate test bug, as I'm pretty sure that I also found the issue there. I'm not too sure what the best solution would be here, as I'm not sure that if we add a placeholder for _no_split_modules will go works with the rest of the test cases!

BenjaminBossan commented 3 months ago

I used the bits and bytes guide which actually uses the PEFT example repo. It seems that both guides work as they reference the same example in the PEFT repo.

Oh I see, I didn't know that.

The wandb callback caused the ValueError because it sets the model to None, causing the error as the model is indeed not a PeftModel anymore

Thanks a lot for this detailed investigation. I agree that this is certainly not the intended behavior. I'm not an expert on Trainer, so I cannot say for sure what the best approach to mitigate the error would be. To me, it sounds a bit wild that we create a new "fake" trainer instance just so that wandb can persist the model and do some other stuff. This looks brittle to me.

As you suggest, one fix could be to ensure that the model is passed correctly, so that the fake trainer instance so that the checks in Trainer.__init__ pass successfully, but to me this sounds more like patching up an anti pattern. I'll tentatively ping @LysandreJik, as he reviewed the PR that introduced this.

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.