Open spgerlach opened 2 days ago
Hi, I'd love to help solve this if possible. New to fixing issues but it seems like:
Also, the Llama tokenizer script uses SPIECE_UNDERLINE to handle spaces, which is not applicable for LLaMA 3.1 as it doesn't use this token. I wasn't exactly sure how to fix this but wanted to point this out (should this be a separate issue?).
It seems like in order to use Llama 3.1 with the LlamaTokenizer the tokenizer.model needs to be converted from a Base64 vocabulary to a SentencePiece model. I drafted a PR that does that within the get_spm_processor method in the tokenization_llama.py file.
The changes added logic to decode each line of the vocabulary file from base64. This allows the tokenizer to handle files that are base64-encoded.
System Info
When initializing LlamaTokenizer from the Transformers library, the tokenizer is being recognized as a bool. This issue persists across different environments and Python versions.
Steps to Reproduce: Install the required libraries: pip install transformers torch sentencepiece
Use the following script to initialize the tokenizer: from transformers.models.llama import LlamaTokenizer
model_path = "C:/Users/spger/.llama/checkpoints/Llama3.1-70B"
try: tokenizer = LlamaTokenizer.from_pretrained(model_path, use_fast=True, legacy=False) print("Tokenizer initialized successfully.") print("Tokenizer type:", type(tokenizer)) except Exception as e: print("Error initializing tokenizer:", e)
Observed Output: The tokenizer type is <class 'bool'> instead of the expected tokenizer class.
System Info: transformers version: 4.46.3
Platform: Windows-10-10.0.26100-SP0
Python version: 3.11.9
Huggingface_hub version: 0.26.3
Safetensors version: 0.4.5
Accelerate version: not installed
Accelerate config: not found
PyTorch version (GPU?): 2.5.1+cpu (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: No
Additional Details: Other tokenizers like AutoTokenizer for GPT-2 and BERT initialize correctly.
Who can help?
@ArthurZucker @itazap
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
from transformers.models.llama import LlamaTokenizer
model_path = "C:/Users/spger/.llama/checkpoints/Llama3.1-70B"
try: tokenizer = LlamaTokenizer.from_pretrained(model_path, use_fast=True, legacy=False) print("Tokenizer initialized successfully.") print("Tokenizer type:", type(tokenizer)) except Exception as e: print("Error initializing tokenizer:", e)
Steps to reproduce the behavior:
Expected behavior
Expected behavior: