Open carlosvillu opened 3 weeks ago
I also just ran into this exact same issue. The model I am using is
https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B
I have taken care of applying the proper chat templates. The training ran successfully but this issue comes during inference.
The same issue here. With the model "llama-3-8b-Instruct-bnb-4bit" in here.
I also just ran into this exact same issue. The model I am using is
https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B
I have taken care of applying the proper chat templates. The training ran successfully but this issue comes during inference.
Found a temporary fix by installing a previous version of transformers
I believe 4.38.0 is the min required transformers version for unsloth.
pip install transformers==4.38.0
We should probably file an issue for this to the huggingface folks
Found a temporary fix by installing a previous version of transformers
I believe 4.38.0 is the min required transformers version for unsloth.
pip install transformers==4.38.0
We should probably file an issue for this to the huggingface folks
which is working! thanks!
Hi @kmahorker,
Your proposal works for me too.
I will open an issue in the transformers repository. Maybe that can help us.
Thanks :)
Hey everyone - much apologies on the horrible late reply - my bro and I both relocated to SF recently, so just got back to Github issues!!
Ok interesting - I tried Colab and Phi, Llama works fine - is this for inference only (ie after training you save it, then you do inference on it?) I shall investigate!
Hey @danielhanchen, just to let you know the bug is gone in transformers==4.41.2. Might help narrow down the bug as i saw push relating to cache in 4.42.1
neither transformers 4.38.0 nor 4.41.2 they do not work with unsloth/tinyllama-bnb-4bit I am trying to use it for inference on T4 KeyError: 'Cache only has 0 layers, attempted to access layer with index 0'
!pip install -U --no-deps xformers "trl<0.9.0" peft accelerate bitsandbytes
!pip install -U "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install -U torch==2.3.0 datasets transformers[torch]==4.41.2
hey @danielhanchen, I am running on A100 and i am getting the same error while inferencing:
i am using pytorch 2.2.0 with cuda 12.1: KeyError: 'Cache only has 0 layers, attempted to access layer with index 0'
i tried below setup: %%capture import torch major_version, minor_version = torch.cuda.get_device_capability()
Must install separately since Colab has torch 2.2.1, which breaks packages !pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" if major_version >= 8:
!pip install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes else:
!pip install --no-deps xformers trl peft accelerate bitsandbytes pass
i used different transformer version , tried without flash_Attn but still the same. Also quick note it works on free colab version.
i using it to fine tune cognitivecomputations/dolphin-2.9.3-llama-3-8b on alpaca dataset. Need help in urget fixing it
I'll try my best to solve this! Much apologies on the issues!
FYI. I managed it to work with transformers 4.41.2. The libraries were not reloaded after the version change.
Hi @usatenko . Can you explain further, I am using jupyter notebook in runpod, if I am getting correctly, you are asking to reload the kernel, which I did? Can you explain further.
@danielhanchen thanks for the quick response, please lmk when it's fixed
Hi @usatenko . Can you explain further, I am using jupyter notebook in runpod, if I am getting correctly, you are asking to reload the kernel, which I did? Can you explain further.
I also run it outside of colab, it is in huggingface spaces with python 3.9 on nvidia t4, the final set of libs I used are:
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers "trl<0.9.0" peft accelerate bitsandbytes
!pip install torch datasets transformers[torch]==4.41.2
the only thing I did is fully removed the packages and installed them from scratch and restarted the kernel, check the order of pip instructions to be as in my code snippet, it may be important.
Hi @usatenko . Can you explain further, I am using jupyter notebook in runpod, if I am getting correctly, you are asking to reload the kernel, which I did? Can you explain further.
I also run it outside of colab, it is in huggingface spaces with python 3.9 on nvidia t4, the final set of libs I used are:
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" !pip install --no-deps xformers "trl<0.9.0" peft accelerate bitsandbytes !pip install torch datasets transformers[torch]==4.41.2
the only thing I did is fully removed the packages and installed them from scratch and restarted the kernel, check the order of pip instructions to be as in my code snippet, it may be important.
I have a bit bounded , I have python 3.10 and have ampere GPUs access like A6000, A400, H100, A100 pcie and sxm version. Also my cuda is 12.1. I tried on T4 GPU and it works but not on Ampere ones
so, on Ampere you still get "Cache only has 0 layers"?
so, on Ampere you still get "Cache only has 0 layers"?
yes! I tried bunch of methods but still getting. Do you have a work around
Unfortunately no, I do not use this hardware, make sure you loaded the proper version (not the latest one) of the transformer lib.
import transformers
print(transformers.__version__)
Hey everyone! It should function finally! Please update Unsloth via (if you're on a local machine - Colab / Kaggle no need to update, just refresh)
pip uninstall unsloth -y
pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git
I'm going assume most of you either used the new transformers version or used the nightly
branch of Unsloth? :)
Anyways so sorry on the delay!
Hey everyone! It should function finally! Please update Unsloth via (if you're on a local machine - Colab / Kaggle no need to update, just refresh)
pip uninstall unsloth -y pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git
I'm going assume most of you either used the new transformers version or used the
nightly
branch of Unsloth? :)Anyways so sorry on the delay!
Hi! @danielhanchen It seems that the latest update makes the model's output unpredictable. The following is my implementation using gemma-2b:
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = args.model_checkpoint, # YOUR MODEL YOU USED FOR TRAINING
max_seq_length = 1024,
dtype = None,
load_in_4bit = False,
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
tokenizer.padding_side = "left"
model.eval()
for prompts in test_data:
input_prompts = tokenizer(prompts, padding=True, truncation=False, return_tensors='pt')
input_ids = input_prompts['input_ids'].to('cuda')
attention_mask = input_prompts['attention_mask'].to('cuda')
output_ids = model.generate(input_ids=input_ids, attention_mask=attention_mask, max_new_tokens=512, do_sample = False, use_cache = True)
output_texts = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
I tried it on both the fine-tuned checkpoint and the original model, but it gives me some unpredictable results, while the output before updating is normal.
It appears that it does not support batch generation?
Hey everyone! It should function finally! Please update Unsloth via (if you're on a local machine - Colab / Kaggle no need to update, just refresh)
pip uninstall unsloth -y pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git
I'm going assume most of you either used the new transformers version or used the
nightly
branch of Unsloth? :)Anyways so sorry on the delay!
Hey @danielhanchen , Thanks for the quick update. It worked for me
@ChenKy23 Weird I'll investigate batched inference
Hey everyone! It should function finally! Please update Unsloth via (if you're on a local machine - Colab / Kaggle no need to update, just refresh)
pip uninstall unsloth -y pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git
I'm going assume most of you either used the new transformers version or used the
nightly
branch of Unsloth? :)Anyways so sorry on the delay!
Should the version of transformers remain old, or it should work with the new one? Still does not work with
#%%capture
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers "trl<0.9.0" peft accelerate bitsandbytes
!pip install torch==2.3.0 datasets transformers[torch]==4.42.3 wandb
@usatenko New transformers - are you certain? You first need to uninstall unsloth, and install it, since pip sometimes doesn't want to install it
Hey everyone! It should function finally! Please update Unsloth via (if you're on a local machine - Colab / Kaggle no need to update, just refresh)
pip uninstall unsloth -y pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git
I'm going assume most of you either used the new transformers version or used the
nightly
branch of Unsloth? :) Anyways so sorry on the delay!Hey @danielhanchen , Thanks for the quick update. It worked for me
The issue still persists on colab.
@usatenko New transformers - are you certain? You first need to uninstall unsloth, and install it, since pip sometimes doesn't want to install it
I fully rebuilt the environment, so it was clean and needed no uninstall
@M4NIACK Did you run https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing - does that Colab work?
You also must call FastLanguageModel.for_inference(model)
before doing inference
@danielhanchen
@M4NIACK Did you run https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing - does that Colab work?
You also must call
FastLanguageModel.for_inference(model)
before doing inference
Is there a way around this requirement, even if it means slower inference? I'm using the model in an evaluation loop and need to continue training after generation. Alternatively, is there a way to revert it back to training mode after calling for_inference
?
I found that after adding _FastLanguageModel.forinference(model) , the issue about '[Cache only has 0 layers, attempted to access layer with index 0]' is just gone, like magic!
Oh forgot to mention you MUST use FastLanguageModel.for_inference(model)
@jonberliner You can use model.for_training
afterwards
Gotcha :D
Daniel Han @.***> 于2024年7月12日周五 14:32写道:
Oh forgot to mention you MUST use FastLanguageModel.for_inference(model)
@jonberliner https://github.com/jonberliner You can use model.for_training afterwards
— Reply to this email directly, view it on GitHub https://github.com/unslothai/unsloth/issues/702#issuecomment-2224885608, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3SOI3X77Q3HXPEQ3LYT2DZL5Z7LAVCNFSM6AAAAABKBURAZGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRUHA4DKNRQHA . You are receiving this because you commented.Message ID: @.***>
I'm encountering a KeyError when trying to train Phi-3 using the unsloth library. The error occurs during the generation step with model.generate. Below are the details of the code and the error traceback.
Steps to Reproduce:
Error Traceback:
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[11], line 21 11 messages = [ 12 {"from": "human", "value": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"}, 13 ] 14 inputs = tokenizer.apply_chat_template( 15 messages, 16 tokenize = True, 17 add_generation_prompt = True, # Must add for generation 18 return_tensors = "pt", 19 ).to("cuda") ---> 21 outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = False) 22 tokenizer.batch_decode(outputs) File [~/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/torch/utils/_contextlib.py:115](http://192.168.1.164:2024/home/carlosvillu/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/torch/utils/_contextlib.py#line=114), in context_decorator.Environment:
Additional Information: The error seems to be related to the dynamic cache handling within the transformers library. The model is trying to access a layer index in the cache that doesn't exist.
Expected Behavior: The model should generate the continuation of the Fibonacci sequence without encountering a KeyError.