Trained from base model output characters seem to be wrong

shyamsn97 / mario-gpt

[Neurips 2023] Generating Mario Levels with GPT2. Code for the paper "MarioGPT: Open-Ended Text2Level Generation through Large Language Models" https://arxiv.org/abs/2302.05981

https://huggingface.co/shyamsn97/Mario-GPT2-700-context-length

MIT License

1.11k stars 103 forks source link

Trained from base model output characters seem to be wrong #26

Closed jamessha closed 1 year ago

jamessha commented 1 year ago

Hi, very interesting project! I'm trying to reproduce the results training from base and running into a problem. Using the training notebook with default parameters on 20k steps, the model is converging to a loss of ~0.05. I'm getting reasonable looking outputs sampling from this trained model but the characters look wrong:

Screenshot 2023-10-18 at 2 19 03 PM

Any ideas on what's going wrong here?

chenxd1996 commented 1 year ago

I have the same issue.

shyamsn97 commented 1 year ago

Hey! Do you have your full code for generating? You need to create the dataset actually to "train" the tokenizer.

shyamsn97 commented 1 year ago


import torch
from mario_gpt import MarioDataset, MarioLM
from mario_gpt.utils import view_level, convert_level_to_png, join_list_of_list, characterize
mario_lm = MarioLM(lm_path="path_to_trained")

dataset = MarioDataset(mario_lm.tokenizer)

# now the tokenizer should be good
view_level(dataset.input_ids[:700], mario_lm.tokenizer)

I've been meaning to change this behavior, but for now this could help I think

jamessha commented 1 year ago

Thanks for the response, turns out the offending line was mario_lm = MarioLM(lm_path=lm_path, tokenizer_path='distilgpt2') I'm not totally sure why I thought this was a good idea 🙃, using either the upstream tokenizer or saving the tokenizer after training works.

I also tried your suggestion, it works but you also need to manually set the lm tokenizer after doing so: mario_lm.tokenizer = dataset.tokenizer

Appreciate the help!