mosaicml / llm-foundry

LLM training code for Databricks foundation models
https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm
Apache License 2.0
4.04k stars 526 forks source link

How can I use custom tokenizer for pretraining? #443

Closed mayankjobanputra closed 1 year ago

mayankjobanputra commented 1 year ago

❓ Question

I want to train a custom tokenizer (just like GPTNeoX tokenizer) using the same script GPTNeoX provides. My questions are as follows:

  1. Do I need to change anything except tokenizer_name in YAML?
  2. If I change the vocabulary size to be smaller, I assume I should also change vocab_size under model config?
  3. Do I need to change anything else?

P.S. Thanks for the amazing repository.

mayankjobanputra commented 1 year ago

In case anyone is wondering the same, you will have to change the tokenizer in llmfoundry/utils/builders.py

and other changes as mentioned above.

tokenizer = Tokenizer.from_file(tokenizer_name)
tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)