LlamaTokenizer being recognized as a bool

System Info

When initializing LlamaTokenizer from the Transformers library, the tokenizer is being recognized as a bool. This issue persists across different environments and Python versions.

Steps to Reproduce: Install the required libraries: pip install transformers torch sentencepiece

Use the following script to initialize the tokenizer: from transformers.models.llama import LlamaTokenizer

model_path = "C:/Users/spger/.llama/checkpoints/Llama3.1-70B"

try: tokenizer = LlamaTokenizer.from_pretrained(model_path, use_fast=True, legacy=False) print("Tokenizer initialized successfully.") print("Tokenizer type:", type(tokenizer)) except Exception as e: print("Error initializing tokenizer:", e)

Observed Output: The tokenizer type is <class 'bool'> instead of the expected tokenizer class.

System Info: transformers version: 4.46.3

Platform: Windows-10-10.0.26100-SP0

Python version: 3.11.9

Huggingface_hub version: 0.26.3

Safetensors version: 0.4.5

Accelerate version: not installed

Accelerate config: not found

PyTorch version (GPU?): 2.5.1+cpu (False)

Tensorflow version (GPU?): not installed (NA)

Flax version (CPU?/GPU?/TPU?): not installed (NA)

Jax version: not installed

JaxLib version: not installed

Using distributed or parallel set-up in script?: No

Additional Details: Other tokenizers like AutoTokenizer for GPT-2 and BERT initialize correctly.

Who can help?

@ArthurZucker @itazap

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

from transformers.models.llama import LlamaTokenizer

model_path = "C:/Users/spger/.llama/checkpoints/Llama3.1-70B"

Steps to reproduce the behavior:

Install the required libraries:


pip install transformers torch sentencepiece

Run the provided script.
Observe that the tokenizer is of type bool.

Expected behavior

Expected behavior:


I expect the `LlamaTokenizer` to be correctly initialized and recognized as a `LlamaTokenizer` object instead of a `bool`.

Hi, I'd love to help solve this if possible. New to fixing issues but it seems like:

Llama 3.1's tokenizer.model is a base64-encoded vocabulary file (126,784 tokens)
LlamaTokenizer expects a SentencePiece model file
The initialization silently fails and returns a bool instead of raising an error

Also, the Llama tokenizer script uses SPIECE_UNDERLINE to handle spaces, which is not applicable for LLaMA 3.1 as it doesn't use this token. I wasn't exactly sure how to fix this but wanted to point this out (should this be a separate issue?).

It seems like in order to use Llama 3.1 with the LlamaTokenizer the tokenizer.model needs to be converted from a Base64 vocabulary to a SentencePiece model. I drafted a PR that does that within the get_spm_processor method in the tokenization_llama.py file.

The changes added logic to decode each line of the vocabulary file from base64. This allows the tokenizer to handle files that are base64-encoded.

huggingface / transformers