Reference for pretraining other small language models

jzhang38 / TinyLlama

The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.

Apache License 2.0

7.3k stars 425 forks source link

Reference for pretraining other small language models #165

Open kmn1024 opened 4 months ago

kmn1024 commented 4 months ago

The README mentions this codebase can act as a "reference for enthusiasts keen on pretraining language models under 5 billion parameters". I'm wondering if you could give a brief guide on how to do so, assuming we start from a transformers config and tokenizer. Something like:

{
  "architectures": [
    "..."
  ],
...
  "model_type": "...",
  "num_hidden_layers": 12,
...
}

Is a lot of work required to change the codebase to support this?

jzhang38 commented 3 months ago

https://github.com/jzhang38/TinyLlama/blob/main/lit_gpt/config.py

You can pick one config from here or create your own config.