unslothai / unsloth

Finetune Llama 3, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
12.82k stars 835 forks source link

Cache only has 0 layers, attempted to access layer with index 0 #702

Open carlosvillu opened 3 weeks ago

carlosvillu commented 3 weeks ago

I'm encountering a KeyError when trying to train Phi-3 using the unsloth library. The error occurs during the generation step with model.generate. Below are the details of the code and the error traceback.

Steps to Reproduce:

  1. Run the following code:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "phi-3", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
)

FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"from": "human", "value": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = False)
tokenizer.batch_decode(outputs)
Error Traceback: --------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[11], line 21 11 messages = [ 12 {"from": "human", "value": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"}, 13 ] 14 inputs = tokenizer.apply_chat_template( 15 messages, 16 tokenize = True, 17 add_generation_prompt = True, # Must add for generation 18 return_tensors = "pt", 19 ).to("cuda") ---> 21 outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = False) 22 tokenizer.batch_decode(outputs) File [~/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/torch/utils/_contextlib.py:115](http://192.168.1.164:2024/home/carlosvillu/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/torch/utils/_contextlib.py#line=114), in context_decorator..decorate_context(*args, **kwargs) 112 @functools.wraps(func) 113 def decorate_context(*args, **kwargs): 114 with ctx_factory(): --> 115 return func(*args, **kwargs) File [~/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/unsloth/models/llama.py:988](http://192.168.1.164:2024/home/carlosvillu/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/unsloth/models/llama.py#line=987), in _wrap_fast_inference.._fast_generate(*args, **kwargs) 986 # Autocasted 987 with torch.autocast(device_type = device_type, dtype = dtype): --> 988 output = generate(*args, **kwargs) 989 pass 991 # Unset a flag for generation! File [~/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/peft/peft_model.py:1491](http://192.168.1.164:2024/home/carlosvillu/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/peft/peft_model.py#line=1490), in PeftModelForCausalLM.generate(self, *args, **kwargs) 1489 with self._enable_peft_forward_hooks(*args, **kwargs): 1490 kwargs = {k: v for k, v in kwargs.items() if k not in self.special_peft_forward_args} -> 1491 outputs = self.base_model.generate(*args, **kwargs) 1492 else: 1493 outputs = self.base_model.generate(**kwargs) File [~/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/torch/utils/_contextlib.py:115](http://192.168.1.164:2024/home/carlosvillu/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/torch/utils/_contextlib.py#line=114), in context_decorator..decorate_context(*args, **kwargs) 112 @functools.wraps(func) 113 def decorate_context(*args, **kwargs): 114 with ctx_factory(): --> 115 return func(*args, **kwargs) File [~/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/transformers/generation/utils.py:1914](http://192.168.1.164:2024/home/carlosvillu/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/transformers/generation/utils.py#line=1913), in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs) 1906 input_ids, model_kwargs = self._expand_inputs_for_generation( 1907 input_ids=input_ids, 1908 expand_size=generation_config.num_return_sequences, 1909 is_encoder_decoder=self.config.is_encoder_decoder, 1910 **model_kwargs, 1911 ) 1913 # 13. run sample (it degenerates to greedy search when `generation_config.do_sample=False`) -> 1914 result = self._sample( 1915 input_ids, 1916 logits_processor=prepared_logits_processor, 1917 logits_warper=prepared_logits_warper, 1918 stopping_criteria=prepared_stopping_criteria, 1919 generation_config=generation_config, 1920 synced_gpus=synced_gpus, 1921 streamer=streamer, 1922 **model_kwargs, 1923 ) 1925 elif generation_mode in (GenerationMode.BEAM_SAMPLE, GenerationMode.BEAM_SEARCH): 1926 # 11. prepare logits warper 1927 prepared_logits_warper = ( 1928 self._get_logits_warper(generation_config, device=input_ids.device) 1929 if generation_config.do_sample 1930 else None 1931 ) File [~/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/transformers/generation/utils.py:2651](http://192.168.1.164:2024/home/carlosvillu/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/transformers/generation/utils.py#line=2650), in GenerationMixin._sample(self, input_ids, logits_processor, stopping_criteria, generation_config, synced_gpus, streamer, logits_warper, **model_kwargs) 2648 model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) 2650 # forward pass to get next token -> 2651 outputs = self( 2652 **model_inputs, 2653 return_dict=True, 2654 output_attentions=output_attentions, 2655 output_hidden_states=output_hidden_states, 2656 ) 2658 if synced_gpus and this_peer_finished: 2659 continue # don't waste resources running the code we don't need File [~/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/torch/nn/modules/module.py:1532](http://192.168.1.164:2024/home/carlosvillu/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/torch/nn/modules/module.py#line=1531), in Module._wrapped_call_impl(self, *args, **kwargs) 1530 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc] 1531 else: -> 1532 return self._call_impl(*args, **kwargs) File [~/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/torch/nn/modules/module.py:1541](http://192.168.1.164:2024/home/carlosvillu/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/torch/nn/modules/module.py#line=1540), in Module._call_impl(self, *args, **kwargs) 1536 # If we don't have any hooks, we want to skip the rest of the logic in 1537 # this function, and just call forward. 1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1539 or _global_backward_pre_hooks or _global_backward_hooks 1540 or _global_forward_hooks or _global_forward_pre_hooks): -> 1541 return forward_call(*args, **kwargs) 1543 try: 1544 result = None File [~/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/accelerate/hooks.py:166](http://192.168.1.164:2024/home/carlosvillu/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/accelerate/hooks.py#line=165), in add_hook_to_module..new_forward(module, *args, **kwargs) 164 output = module._old_forward(*args, **kwargs) 165 else: --> 166 output = module._old_forward(*args, **kwargs) 167 return module._hf_hook.post_forward(module, output) File [~/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/unsloth/models/mistral.py:205](http://192.168.1.164:2024/home/carlosvillu/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/unsloth/models/mistral.py#line=204), in MistralForCausalLM_fast_forward(self, input_ids, causal_mask, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict, *args, **kwargs) 202 self.model._has_no_labels = labels is None 204 if past_key_values is not None: --> 205 outputs = LlamaModel_fast_forward_inference( 206 self, 207 input_ids, 208 past_key_values, 209 position_ids = position_ids, 210 attention_mask = attention_mask, 211 ) 212 else: 213 outputs = self.model( 214 input_ids=input_ids, 215 causal_mask=causal_mask, (...) 223 return_dict=return_dict, 224 ) File [~/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/unsloth/models/llama.py:733](http://192.168.1.164:2024/home/carlosvillu/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/unsloth/models/llama.py#line=732), in LlamaModel_fast_forward_inference(self, input_ids, past_key_values, position_ids, attention_mask) 731 hidden_states = hidden_states.to(self.config.torch_dtype) 732 bsz, q_len, hd = hidden_states.shape --> 733 seq_len = past_key_values[0][0].shape[-2] 734 if bsz != 1: 735 attention_mask = _prepare_4d_causal_attention_mask_for_sdpa( 736 attention_mask, 737 (bsz, q_len), (...) 740 sliding_window = getattr(self.config, "sliding_window", None), 741 ) File [~/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/transformers/cache_utils.py:314](http://192.168.1.164:2024/home/carlosvillu/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/transformers/cache_utils.py#line=313), in DynamicCache.__getitem__(self, layer_idx) 312 return (self.key_cache[layer_idx], self.value_cache[layer_idx]) 313 else: --> 314 raise KeyError(f"Cache only has {len(self)} layers, attempted to access layer with index {layer_idx}") KeyError: 'Cache only has 0 layers, attempted to access layer with index 0'

Environment:

Additional Information: The error seems to be related to the dynamic cache handling within the transformers library. The model is trying to access a layer index in the cache that doesn't exist.

Expected Behavior: The model should generate the continuation of the Fibonacci sequence without encountering a KeyError.

AliButtar commented 3 weeks ago

I also just ran into this exact same issue. The model I am using is

https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B

I have taken care of applying the proper chat templates. The training ran successfully but this issue comes during inference.

jocastrocUnal commented 3 weeks ago

The same issue here. With the model "llama-3-8b-Instruct-bnb-4bit" in here.

I also just ran into this exact same issue. The model I am using is

https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B

I have taken care of applying the proper chat templates. The training ran successfully but this issue comes during inference.

kmahorker commented 2 weeks ago

Found a temporary fix by installing a previous version of transformers

I believe 4.38.0 is the min required transformers version for unsloth.

pip install transformers==4.38.0

We should probably file an issue for this to the huggingface folks

yuki-2025 commented 2 weeks ago

Found a temporary fix by installing a previous version of transformers

I believe 4.38.0 is the min required transformers version for unsloth.

pip install transformers==4.38.0

We should probably file an issue for this to the huggingface folks

which is working! thanks!

carlosvillu commented 2 weeks ago

Hi @kmahorker,

Your proposal works for me too.

I will open an issue in the transformers repository. Maybe that can help us.

Thanks :)

danielhanchen commented 2 weeks ago

Hey everyone - much apologies on the horrible late reply - my bro and I both relocated to SF recently, so just got back to Github issues!!

Ok interesting - I tried Colab and Phi, Llama works fine - is this for inference only (ie after training you save it, then you do inference on it?) I shall investigate!

vTuanpham commented 2 weeks ago

Hey @danielhanchen, just to let you know the bug is gone in transformers==4.41.2. Might help narrow down the bug as i saw push relating to cache in 4.42.1

usatenko commented 2 weeks ago

neither transformers 4.38.0 nor 4.41.2 they do not work with unsloth/tinyllama-bnb-4bit I am trying to use it for inference on T4 KeyError: 'Cache only has 0 layers, attempted to access layer with index 0'

!pip install -U --no-deps xformers "trl<0.9.0" peft accelerate bitsandbytes
!pip install -U "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install -U torch==2.3.0 datasets transformers[torch]==4.41.2
AashishKumar-3002 commented 2 weeks ago

hey @danielhanchen, I am running on A100 and i am getting the same error while inferencing:

i am using pytorch 2.2.0 with cuda 12.1: KeyError: 'Cache only has 0 layers, attempted to access layer with index 0'

i tried below setup: %%capture import torch major_version, minor_version = torch.cuda.get_device_capability()

Must install separately since Colab has torch 2.2.1, which breaks packages !pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" if major_version >= 8:

Use this for new GPUs like Ampere, Hopper GPUs (RTX 30xx, RTX 40xx, A100, H100, L40)

!pip install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes else:

Use this for older GPUs (V100, Tesla T4, RTX 20xx)

!pip install --no-deps xformers trl peft accelerate bitsandbytes pass

i used different transformer version , tried without flash_Attn but still the same. Also quick note it works on free colab version.

i using it to fine tune cognitivecomputations/dolphin-2.9.3-llama-3-8b on alpaca dataset. Need help in urget fixing it

danielhanchen commented 2 weeks ago

I'll try my best to solve this! Much apologies on the issues!

usatenko commented 2 weeks ago

FYI. I managed it to work with transformers 4.41.2. The libraries were not reloaded after the version change.

AashishKumar-3002 commented 2 weeks ago

Hi @usatenko . Can you explain further, I am using jupyter notebook in runpod, if I am getting correctly, you are asking to reload the kernel, which I did? Can you explain further.

AashishKumar-3002 commented 2 weeks ago

@danielhanchen thanks for the quick response, please lmk when it's fixed

usatenko commented 2 weeks ago

Hi @usatenko . Can you explain further, I am using jupyter notebook in runpod, if I am getting correctly, you are asking to reload the kernel, which I did? Can you explain further.

I also run it outside of colab, it is in huggingface spaces with python 3.9 on nvidia t4, the final set of libs I used are:

!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers "trl<0.9.0" peft accelerate bitsandbytes
!pip install torch datasets transformers[torch]==4.41.2

the only thing I did is fully removed the packages and installed them from scratch and restarted the kernel, check the order of pip instructions to be as in my code snippet, it may be important.

AashishKumar-3002 commented 2 weeks ago

Hi @usatenko . Can you explain further, I am using jupyter notebook in runpod, if I am getting correctly, you are asking to reload the kernel, which I did? Can you explain further.

I also run it outside of colab, it is in huggingface spaces with python 3.9 on nvidia t4, the final set of libs I used are:

!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers "trl<0.9.0" peft accelerate bitsandbytes
!pip install torch datasets transformers[torch]==4.41.2

the only thing I did is fully removed the packages and installed them from scratch and restarted the kernel, check the order of pip instructions to be as in my code snippet, it may be important.

I have a bit bounded , I have python 3.10 and have ampere GPUs access like A6000, A400, H100, A100 pcie and sxm version. Also my cuda is 12.1. I tried on T4 GPU and it works but not on Ampere ones

usatenko commented 2 weeks ago

so, on Ampere you still get "Cache only has 0 layers"?

AashishKumar-3002 commented 2 weeks ago

so, on Ampere you still get "Cache only has 0 layers"?

yes! I tried bunch of methods but still getting. Do you have a work around

usatenko commented 2 weeks ago

Unfortunately no, I do not use this hardware, make sure you loaded the proper version (not the latest one) of the transformer lib.

import transformers
print(transformers.__version__)
danielhanchen commented 2 weeks ago

Hey everyone! It should function finally! Please update Unsloth via (if you're on a local machine - Colab / Kaggle no need to update, just refresh)

pip uninstall unsloth -y
pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git

I'm going assume most of you either used the new transformers version or used the nightly branch of Unsloth? :)

Anyways so sorry on the delay!

ChenKy23 commented 2 weeks ago

Hey everyone! It should function finally! Please update Unsloth via (if you're on a local machine - Colab / Kaggle no need to update, just refresh)

pip uninstall unsloth -y
pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git

I'm going assume most of you either used the new transformers version or used the nightly branch of Unsloth? :)

Anyways so sorry on the delay!

Hi! @danielhanchen It seems that the latest update makes the model's output unpredictable. The following is my implementation using gemma-2b:

 model, tokenizer = FastLanguageModel.from_pretrained(
      model_name = args.model_checkpoint, # YOUR MODEL YOU USED FOR TRAINING
      max_seq_length = 1024,
      dtype = None,
      load_in_4bit = False,
)

FastLanguageModel.for_inference(model) # Enable native 2x faster inference

tokenizer.padding_side = "left"

model.eval()

for prompts in test_data:
      input_prompts = tokenizer(prompts, padding=True, truncation=False, return_tensors='pt')
      input_ids = input_prompts['input_ids'].to('cuda')
      attention_mask = input_prompts['attention_mask'].to('cuda')
      output_ids = model.generate(input_ids=input_ids, attention_mask=attention_mask, max_new_tokens=512, do_sample = False, use_cache = True)
      output_texts = tokenizer.batch_decode(output_ids, skip_special_tokens=True)

I tried it on both the fine-tuned checkpoint and the original model, but it gives me some unpredictable results, while the output before updating is normal.

It appears that it does not support batch generation?

AashishKumar-3002 commented 2 weeks ago

Hey everyone! It should function finally! Please update Unsloth via (if you're on a local machine - Colab / Kaggle no need to update, just refresh)

pip uninstall unsloth -y
pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git

I'm going assume most of you either used the new transformers version or used the nightly branch of Unsloth? :)

Anyways so sorry on the delay!

Hey @danielhanchen , Thanks for the quick update. It worked for me

danielhanchen commented 2 weeks ago

@ChenKy23 Weird I'll investigate batched inference

usatenko commented 2 weeks ago

Hey everyone! It should function finally! Please update Unsloth via (if you're on a local machine - Colab / Kaggle no need to update, just refresh)

pip uninstall unsloth -y
pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git

I'm going assume most of you either used the new transformers version or used the nightly branch of Unsloth? :)

Anyways so sorry on the delay!

Should the version of transformers remain old, or it should work with the new one? Still does not work with

#%%capture
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers "trl<0.9.0" peft accelerate bitsandbytes
!pip install torch==2.3.0 datasets transformers[torch]==4.42.3 wandb
danielhanchen commented 1 week ago

@usatenko New transformers - are you certain? You first need to uninstall unsloth, and install it, since pip sometimes doesn't want to install it

M4NIACK commented 1 week ago

Hey everyone! It should function finally! Please update Unsloth via (if you're on a local machine - Colab / Kaggle no need to update, just refresh)

pip uninstall unsloth -y
pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git

I'm going assume most of you either used the new transformers version or used the nightly branch of Unsloth? :) Anyways so sorry on the delay!

Hey @danielhanchen , Thanks for the quick update. It worked for me

The issue still persists on colab.

usatenko commented 1 week ago

@usatenko New transformers - are you certain? You first need to uninstall unsloth, and install it, since pip sometimes doesn't want to install it

I fully rebuilt the environment, so it was clean and needed no uninstall

danielhanchen commented 1 week ago

@M4NIACK Did you run https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing - does that Colab work?

You also must call FastLanguageModel.for_inference(model) before doing inference

jonberliner commented 1 week ago

@danielhanchen

@M4NIACK Did you run https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing - does that Colab work?

You also must call FastLanguageModel.for_inference(model) before doing inference

Is there a way around this requirement, even if it means slower inference? I'm using the model in an evaluation loop and need to continue training after generation. Alternatively, is there a way to revert it back to training mode after calling for_inference?

sword-ace commented 1 week ago

I found that after adding _FastLanguageModel.forinference(model) , the issue about '[Cache only has 0 layers, attempted to access layer with index 0]' is just gone, like magic!

danielhanchen commented 1 week ago

Oh forgot to mention you MUST use FastLanguageModel.for_inference(model)

@jonberliner You can use model.for_training afterwards

sword-ace commented 1 week ago

Gotcha :D

Daniel Han @.***> 于2024年7月12日周五 14:32写道:

Oh forgot to mention you MUST use FastLanguageModel.for_inference(model)

@jonberliner https://github.com/jonberliner You can use model.for_training afterwards

— Reply to this email directly, view it on GitHub https://github.com/unslothai/unsloth/issues/702#issuecomment-2224885608, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3SOI3X77Q3HXPEQ3LYT2DZL5Z7LAVCNFSM6AAAAABKBURAZGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRUHA4DKNRQHA . You are receiving this because you commented.Message ID: @.***>