beam search does not work for gemma2b

world2vec commented 2 months ago

Env: torch2.4 cuda 12.4 unsloth main below is the code errored

from unsloth import FastLanguageModel
import torch

model_id="unsloth/gemma-2-2b-it-bnb-4bit"
model, tokenizer = FastLanguageModel.from_pretrained(model_id, dtype=torch.float16, use_cache=False,
                                                         max_seq_length=1024, load_in_4bit=True)
FastLanguageModel.for_inference(model)
input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids, num_beams=2, max_new_tokens=10)

error:


NotImplementedError: Make sure that a `_reorder_cache` function is correctly implemented in transformers.models.gemma2.modeling_gemma2 to enable beam search for <class 'transformers.models.gemma2.modeling_gemma2.Gemma2ForCausalLM'>

If use huggingface code there is no error:

from unsloth import FastLanguageModel
import torch

model_id="unsloth/gemma-2-2b-it-bnb-4bit"
model, tokenizer = FastLanguageModel.from_pretrained(model_id, dtype=torch.float16, use_cache=False,
                                                         max_seq_length=1024, load_in_4bit=True)
FastLanguageModel.for_inference(model)
input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids, num_beams=2, max_new_tokens=10)

danielhanchen commented 2 months ago

Will check this out!

practicingman commented 2 months ago

this happens with unsloth/Meta-Llama-3.1-8B too. when I add use_cache=False to model.generate. it raises

RuntimeError: The size of tensor a (32) must match the size of tensor b (1300) at non-singleton dimension 1

danielhanchen commented 2 months ago

Hmm ok will reinvestigate!

anderleich commented 1 week ago

Any news on this? I get the same error

shimmyshimmer commented 5 days ago

Still investigating this!

unslothai / unsloth

beam search does not work for gemma2b #923