Closed ChinJianGin closed 5 months ago
Hey! You can see an example for training in the train notebook, but here’s where I get the tokenizer ready: https://github.com/shyamsn97/mario-gpt/blob/main/mario_gpt/dataset.py#L68
Oh! I got it! Thank you very much for help. Have a nice day!
Hello, Your project is awesome, and I'm delighted to have such a fantastic project. When I cloned your project and trained it myself, I tried to save the tokenizer like save model, but I found the tokenizer_config.json and tokenizer.json not like your config. I don't know how to resize the vocab from 50256 to 256 and set "endoftext" token to id = 0. Could you give me some tips how to fine-tune the tokenizer? I did this because when I ran trained model, I have to use your tokenizer, if I use the tokenizer that I saved decode will go wrong.
This is my tokenizer_config.json file
This is my tokenizer.json file