Fine-tuned tokenizer - Githubissues

ChinJianGin commented 9 months ago

Hello, Your project is awesome, and I'm delighted to have such a fantastic project. When I cloned your project and trained it myself, I tried to save the tokenizer like save model, but I found the tokenizer_config.json and tokenizer.json not like your config. I don't know how to resize the vocab from 50256 to 256 and set "endoftext" token to id = 0. Could you give me some tips how to fine-tune the tokenizer? I did this because when I ran trained model, I have to use your tokenizer, if I use the tokenizer that I saved decode will go wrong.

This is my tokenizer_config.json file

This is my tokenizer.json file

shyamsn97 commented 6 months ago

Hey! You can see an example for training in the train notebook, but here’s where I get the tokenizer ready: https://github.com/shyamsn97/mario-gpt/blob/main/mario_gpt/dataset.py#L68

ChinJianGin commented 6 months ago

Oh! I got it! Thank you very much for help. Have a nice day!

shyamsn97 / mario-gpt

Fine-tuned tokenizer #29