jzhang38 / TinyLlama

The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.
Apache License 2.0
7.3k stars 425 forks source link

Training Run - New Tokenizer #185

Open dustinwloring1988 opened 1 month ago

dustinwloring1988 commented 1 month ago

Hello I was attempting to recreate this but with the tokenizer from llama3 (tiktoken) but with a few changes. I would be ok training a tiktoken from scratch if needed but could not find the code to do so. I was trying to add Fill In The Middle (FIM) tokens then train on 2 different kinds of pretraining datasets on for next text predict and one for FIM. I figured a small model of this size would be great for testing.

If anyone has more info on either way I would appreciate it. Also if there is a better training document for this project I would be interested in a link.

cduk commented 1 month ago

What changes did you plan to make with the tokenizer?