huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.84k stars 27.19k forks source link

LlamaTokenizer being recognized as a bool #35037

Open spgerlach opened 2 days ago

spgerlach commented 2 days ago

System Info

When initializing LlamaTokenizer from the Transformers library, the tokenizer is being recognized as a bool. This issue persists across different environments and Python versions.

Steps to Reproduce: Install the required libraries: pip install transformers torch sentencepiece

Use the following script to initialize the tokenizer: from transformers.models.llama import LlamaTokenizer

model_path = "C:/Users/spger/.llama/checkpoints/Llama3.1-70B"

try: tokenizer = LlamaTokenizer.from_pretrained(model_path, use_fast=True, legacy=False) print("Tokenizer initialized successfully.") print("Tokenizer type:", type(tokenizer)) except Exception as e: print("Error initializing tokenizer:", e)

Observed Output: The tokenizer type is <class 'bool'> instead of the expected tokenizer class.

System Info: transformers version: 4.46.3

Platform: Windows-10-10.0.26100-SP0

Python version: 3.11.9

Huggingface_hub version: 0.26.3

Safetensors version: 0.4.5

Accelerate version: not installed

Accelerate config: not found

PyTorch version (GPU?): 2.5.1+cpu (False)

Tensorflow version (GPU?): not installed (NA)

Flax version (CPU?/GPU?/TPU?): not installed (NA)

Jax version: not installed

JaxLib version: not installed

Using distributed or parallel set-up in script?: No

Additional Details: Other tokenizers like AutoTokenizer for GPT-2 and BERT initialize correctly.

Who can help?

@ArthurZucker @itazap

Information

Tasks

Reproduction

from transformers.models.llama import LlamaTokenizer

model_path = "C:/Users/spger/.llama/checkpoints/Llama3.1-70B"

try: tokenizer = LlamaTokenizer.from_pretrained(model_path, use_fast=True, legacy=False) print("Tokenizer initialized successfully.") print("Tokenizer type:", type(tokenizer)) except Exception as e: print("Error initializing tokenizer:", e)

Steps to reproduce the behavior:

  1. Install the required libraries:
    
    pip install transformers torch sentencepiece
  2. Run the provided script.
  3. Observe that the tokenizer is of type bool.

Expected behavior

Expected behavior:


I expect the `LlamaTokenizer` to be correctly initialized and recognized as a `LlamaTokenizer` object instead of a `bool`.
aka133 commented 1 day ago

Hi, I'd love to help solve this if possible. New to fixing issues but it seems like:

Also, the Llama tokenizer script uses SPIECE_UNDERLINE to handle spaces, which is not applicable for LLaMA 3.1 as it doesn't use this token. I wasn't exactly sure how to fix this but wanted to point this out (should this be a separate issue?).

It seems like in order to use Llama 3.1 with the LlamaTokenizer the tokenizer.model needs to be converted from a Base64 vocabulary to a SentencePiece model. I drafted a PR that does that within the get_spm_processor method in the tokenization_llama.py file.

The changes added logic to decode each line of the vocabulary file from base64. This allows the tokenizer to handle files that are base64-encoded.