How can I use custom tokenizer for pretraining?

mosaicml / llm-foundry

LLM training code for Databricks foundation models

Apache License 2.0

4.04k stars 526 forks source link

Closed mayankjobanputra closed 1 year ago

mayankjobanputra commented 1 year ago

I want to train a custom tokenizer (just like GPTNeoX tokenizer) using the same script GPTNeoX provides. My questions are as follows:

Do I need to change anything except tokenizer_name in YAML?
If I change the vocabulary size to be smaller, I assume I should also change vocab_size under model config?
Do I need to change anything else?

P.S. Thanks for the amazing repository.

mayankjobanputra commented 1 year ago

In case anyone is wondering the same, you will have to change the tokenizer in llmfoundry/utils/builders.py

and other changes as mentioned above.

tokenizer = Tokenizer.from_file(tokenizer_name)
tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)