Training Run - New Tokenizer

Hello I was attempting to recreate this but with the tokenizer from llama3 (tiktoken) but with a few changes. I would be ok training a tiktoken from scratch if needed but could not find the code to do so. I was trying to add Fill In The Middle (FIM) tokens then train on 2 different kinds of pretraining datasets on for next text predict and one for FIM. I figured a small model of this size would be great for testing.

If anyone has more info on either way I would appreciate it. Also if there is a better training document for this project I would be interested in a link.

jzhang38 / TinyLlama

Training Run - New Tokenizer #185