jzhang38 / TinyLlama

The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.
Apache License 2.0
7.61k stars 444 forks source link

Why is tokenizer.model_max_length set to 1000000000000000019884624838656? #23

Closed kevinhu closed 12 months ago

kevinhu commented 1 year ago

See https://huggingface.co/PY007/TinyLlama-1.1B-step-50K-105b/blob/main/tokenizer_config.json#L22

ChuXNobody commented 1 year ago

不涉及模型训练参数,跟训练集挂钩的设定,获取max token

jzhang38 commented 12 months ago

It is also present in the llama 2 config: https://huggingface.co/meta-llama/Llama-2-7b-hf/blob/6fdf2e60f86ff2481f2241aaee459f85b5b0bbb9/tokenizer_config.json#L22. I guess it is present to be compatible with other HuggingFace tokenizer. You can safely ignore it.