Open ccdv-ai opened 3 months ago
@danielhanchen ok this is something with sentencepiece Some models are missing the tokenizer ".model" file so the fast tokenizer can be loaded but not the slow one. And there is no easy way to recover the file.
Considering that, load_correct_tokenizer function fails.
Possible fix:
@ccdv-ai Yes! Was working on a fix, but sadly startup life is all consuming :( Will get back to this! :)
@ccdv-ai Fixed! Local machines will need updating via pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git
. Colab / Kaggle should be fine
@danielhanchen Almost!
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
tokenizer,
chat_template = "chatml", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
map_eos_token = True, # Maps <|im_end|> to </s> instead
)
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
[<ipython-input-5-2a9309eef342>](https://localhost:8080/#) in <cell line: 3>()
1 from unsloth.chat_templates import get_chat_template
2
----> 3 tokenizer = get_chat_template(
4 tokenizer,
5 chat_template = "chatml", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
1 frames
[/usr/local/lib/python3.10/dist-packages/unsloth/chat_templates.py](https://localhost:8080/#) in get_chat_template(tokenizer, chat_template, mapping, map_eos_token)
377 # Must fix the sentence piece tokenizer since there's no tokenizer.model file!
378 token_mapping = { old_eos_token : stop_word, }
--> 379 tokenizer = fix_sentencepiece_tokenizer(tokenizer, new_tokenizer, token_mapping,)
380 pass
381
[/usr/local/lib/python3.10/dist-packages/unsloth/tokenizer_utils.py](https://localhost:8080/#) in fix_sentencepiece_tokenizer(old_tokenizer, new_tokenizer, token_mapping, temporary_location)
220
221 tokenizer_file = sentencepiece_model_pb2.ModelProto()
--> 222 tokenizer_file.ParseFromString(open(f"{temporary_location}/tokenizer.model", "rb").read())
223
224 # Now save the new tokenizer
FileNotFoundError: [Errno 2] No such file or directory: '_unsloth_sentencepiece_temp/tokenizer.model'
@ccdv-ai Oh no ok ok ill check again! Sorry!
@danielhanchen Looks like this issue also happens with llama-3. Conversational notebooks cannot be run currently if the tokenizer is BPE/sentencepiece.
--> 222 tokenizer_file.ParseFromString(open(f"{temporary_location}/tokenizer.model", "rb").read())
@ccdv-ai Yes sadly - been stuck on llama-3 so ye it is an issue :(( Extreme apologies on the horrible delay
@ccdv-ai Maybe I fixed this in the latest bug fix maybe?
@danielhanchen yes, looks like its fixed! thank you
Trying to run the colab using a small model:
The model is loaded but it fails to load the tokenizer: