vivo-ai-lab / BlueLM

BlueLM(蓝心大模型): Open large language models developed by vivo AI Lab
https://developers.vivo.com/product/ai/bluelm
Other
846 stars 58 forks source link

为什么论文和config.json中的配置文件提到vocab_size是100096,但是实际代码调用发现词表数量只有100004个呢? #23

Open ouening opened 6 months ago

ouening commented 6 months ago

模型:vivo-ai/BlueLM-7B-Chat-32K 地址:https://huggingface.co/vivo-ai/BlueLM-7B-Chat-32K/tree/main 调试代码:

from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "vivo-ai/BlueLM-7B-Chat-32K"
tokenizer2 = AutoTokenizer.from_pretrained(model_name,trust_remote_code=True)
vocab2 = tokenizer2.get_vocab()
print(len(vocab2)) # 100004,
print(tokenizer2.vocab_size) # 100000

多了4个special token:

"[\|Human\|]:", --   | "[\|AI\|]:",   | "[SEH]",   | "[SEA]"