unslothai / unsloth

Finetune Llama 3.1, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
15.8k stars 1.07k forks source link

Load Unsloth-FT-Merged-Model with AutoModel Attribute Error #896

Open carstendraschner opened 1 month ago

carstendraschner commented 1 month ago

Hello :)

We used the default Unsloth Colab Pipeline to ft a LLAMA3.1 8B and replicated this as a notebook on an azure environment. https://colab.research.google.com/drive/1Ys44kVvmeZtnICzWz0xgpRnrIOjZAuxp?usp=sharing#scrollTo=QmUBVEnvCDJv The finetuning worked and is tested via inference using FastLanguageModel. We merged the model with 16bit and stored it locally to load run it with default AutoModel.

from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    model_path, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_path, padding_side="left")
model_inputs = tokenizer(template.format(
            "",
            "Who is A. Dumbledore?",
            "",
        ), return_tensors="pt").to("cuda")
generated_ids = model.generate(**model_inputs)
tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

But we get: AttributeError: 'LlamaForCausalLM' object has no attribute 'max_seq_length' Could you help? Full stack trace


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[46], line 1
----> 1 generated_ids = model.generate(**model_inputs)
      2 tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

File ~/code/genai-ml/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File ~/code/genai-ml/.venv/lib/python3.10/site-packages/transformers/generation/utils.py:2024, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
   2016     input_ids, model_kwargs = self._expand_inputs_for_generation(
   2017         input_ids=input_ids,
   2018         expand_size=generation_config.num_return_sequences,
   2019         is_encoder_decoder=self.config.is_encoder_decoder,
   2020         **model_kwargs,
   2021     )
   2023     # 13. run sample (it degenerates to greedy search when `generation_config.do_sample=False`)
-> 2024     result = self._sample(
   2025         input_ids,
   2026         logits_processor=prepared_logits_processor,
   2027         logits_warper=prepared_logits_warper,
   2028         stopping_criteria=prepared_stopping_criteria,
   2029         generation_config=generation_config,
   2030         synced_gpus=synced_gpus,
   2031         streamer=streamer,
   2032         **model_kwargs,
   2033     )
   2035 elif generation_mode in (GenerationMode.BEAM_SAMPLE, GenerationMode.BEAM_SEARCH):
   2036     # 11. prepare logits warper
   2037     prepared_logits_warper = (
   2038         self._get_logits_warper(generation_config, device=input_ids.device)
   2039         if generation_config.do_sample
   2040         else None
   2041     )

File ~/code/genai-ml/.venv/lib/python3.10/site-packages/transformers/generation/utils.py:2982, in GenerationMixin._sample(self, input_ids, logits_processor, stopping_criteria, generation_config, synced_gpus, streamer, logits_warper, **model_kwargs)
   2979 model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {})
   2981 # forward pass to get next token
-> 2982 outputs = self(**model_inputs, return_dict=True)
   2984 if synced_gpus and this_peer_finished:
   2985     continue  # don't waste resources running the code we don't need

File ~/code/genai-ml/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1532, in Module._wrapped_call_impl(self, *args, **kwargs)
   1530     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1531 else:
-> 1532     return self._call_impl(*args, **kwargs)

File ~/code/genai-ml/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1541, in Module._call_impl(self, *args, **kwargs)
   1536 # If we don't have any hooks, we want to skip the rest of the logic in
   1537 # this function, and just call forward.
   1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1539         or _global_backward_pre_hooks or _global_backward_hooks
   1540         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1541     return forward_call(*args, **kwargs)
   1543 try:
   1544     result = None

File ~/code/genai-ml/.venv/lib/python3.10/site-packages/unsloth/models/llama.py:864, in CausalLM_fast_forward.<locals>._CausalLM_fast_forward(self, input_ids, causal_mask, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict, *args, **kwargs)
    847 def _CausalLM_fast_forward(
    848     self,
    849     input_ids: torch.LongTensor = None,
   (...)
    860     *args, **kwargs,
    861 ) -> Union[Tuple, CausalLMOutputWithPast]:
    863     if past_key_values is not None:
--> 864         outputs = fast_forward_inference(
    865             self,
    866             input_ids,
    867             past_key_values,
    868             position_ids = position_ids,
    869             attention_mask = attention_mask,
    870         )
    871     else:
    872         causal_mask = xformers.attn_bias.LowerTriangularMask()

File ~/code/genai-ml/.venv/lib/python3.10/site-packages/unsloth/models/llama.py:797, in LlamaModel_fast_forward_inference(self, input_ids, past_key_values, position_ids, attention_mask)
    790 def LlamaModel_fast_forward_inference(
    791     self,
    792     input_ids,
   (...)
    795     attention_mask = None,
    796 ):
--> 797     input_ids = input_ids[:,:self.max_seq_length]
    798     hidden_states = self.model.embed_tokens(input_ids)
    799     hidden_states = hidden_states.to(self.config.torch_dtype)

File ~/code/genai-ml/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1709, in Module.__getattr__(self, name)
   1707     if name in modules:
   1708         return modules[name]
-> 1709 raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")

AttributeError: 'LlamaForCausalLM' object has no attribute 'max_seq_length'```
carstendraschner commented 1 month ago

This is also reproducible with the default Unsloth Colab Notebook: There is the step:

# I highly do NOT suggest - use Unsloth if possible
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
model = AutoPeftModelForCausalLM.from_pretrained(
    "lora_model", # YOUR MODEL YOU USED FOR TRAINING
    load_in_4bit = load_in_4bit,
    low_cpu_mem_usage = True
)
tokenizer = AutoTokenizer.from_pretrained("lora_model")

I I add the code from above to generate output:

inputs = tokenizer(
[
    alpaca_prompt.format(
        "What is a famous tall tower in Paris?", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

I also get:

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
What is a famous tall tower in Paris?

### Input:

### Response:
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
[<ipython-input-16-6090b4ce7ab2>](https://fc110avl77-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab_20240805-060208_RC00_659502231#) in <cell line: 12>()
     10 from transformers import TextStreamer
     11 text_streamer = TextStreamer(tokenizer)
---> 12 _ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

9 frames
[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://fc110avl77-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab_20240805-060208_RC00_659502231#) in __getattr__(self, name)
   1707             if name in modules:
   1708                 return modules[name]
-> 1709         raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
   1710 
   1711     def __setattr__(self, name: str, value: Union[Tensor, 'Module']) -> None:

AttributeError: 'LlamaForCausalLM' object has no attribute 'max_seq_length'

This might help you to reproduce despite the fact that this is based on the lora adapter already kindest regards

carstendraschner commented 1 month ago

I was also wondering why:

File ~/code/genai-ml/.venv/lib/python3.10/site-packages/unsloth/models/llama.py:864, in CausalLM_fast_forward.<locals>._CausalLM_fast_forward(self, input_ids, causal_mask, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict, *args, **kwargs)

happens in the initial stack trace as it should not call unclothe, right?

carstendraschner commented 1 month ago

Interesting: When I use the stored files within an environment where unsloth is not installed, the model loading and inference works

from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    awq_model_path, device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(awq_model_path, padding_side="left")
model_inputs = tokenizer("""
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

<|eot_id|><|start_header_id|>user<|end_header_id|>
Who is A. Dumbledore?<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
""", return_tensors="pt").to("cuda")
generated_ids = model.generate(**model_inputs)
tokenizer.batch_decode(generated_ids, skip_special_tokens=False)[0]
danielhanchen commented 1 month ago

Oh wait - when you load a model locally, do not call Unsloth or load Unsloth anywhere in your code! It'll patch over everything, causing issues

carstendraschner commented 1 month ago

Alright, thank you very much @danielhanchen for this feedback. this suits also my experience. maybe this needs to be clarified within the tutorial Colab notebooks where it appears to be compliant to use still AutoModel after importing and using Unsloth.

Overall thanks for the quick reply and for building Unsloth in general ;) Regards Carsten

danielhanchen commented 1 month ago

Good point on that!