shyamsn97 / mario-gpt

[Neurips 2023] Generating Mario Levels with GPT2. Code for the paper "MarioGPT: Open-Ended Text2Level Generation through Large Language Models" https://arxiv.org/abs/2302.05981
https://huggingface.co/shyamsn97/Mario-GPT2-700-context-length
MIT License
1.11k stars 101 forks source link

Fine-tuned tokenizer #29

Closed ChinJianGin closed 5 months ago

ChinJianGin commented 8 months ago

Hello, Your project is awesome, and I'm delighted to have such a fantastic project. When I cloned your project and trained it myself, I tried to save the tokenizer like save model, but I found the tokenizer_config.json and tokenizer.json not like your config. I don't know how to resize the vocab from 50256 to 256 and set "endoftext" token to id = 0. Could you give me some tips how to fine-tune the tokenizer? I did this because when I ran trained model, I have to use your tokenizer, if I use the tokenizer that I saved decode will go wrong.

This is my tokenizer_config.json file image

This is my tokenizer.json file image image

shyamsn97 commented 5 months ago

Hey! You can see an example for training in the train notebook, but here’s where I get the tokenizer ready: https://github.com/shyamsn97/mario-gpt/blob/main/mario_gpt/dataset.py#L68

ChinJianGin commented 5 months ago

Oh! I got it! Thank you very much for help. Have a nice day!