huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.68k stars 746 forks source link

Accessing Tokenizer's model_max_length config #1451

Closed dopc closed 4 months ago

dopc commented 4 months ago

Hey, hope you are well.

I am willing to use Cohere's HF tokenizers and get the model_max_length information from its tokenizer_config.json file. Is there any way to do that?

from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained(identifier="Cohere/Cohere-embed-multilingual-light-v3.0")

# tokenizer.max_len
# tokenizer.model_max_length
dopc commented 4 months ago

I found the below method gives the access

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Cohere/Cohere-embed-multilingual-light-v3.0")