Closed jamessha closed 1 year ago
I have the same issue.
Hey! Do you have your full code for generating? You need to create the dataset actually to "train" the tokenizer.
import torch
from mario_gpt import MarioDataset, MarioLM
from mario_gpt.utils import view_level, convert_level_to_png, join_list_of_list, characterize
mario_lm = MarioLM(lm_path="path_to_trained")
dataset = MarioDataset(mario_lm.tokenizer)
# now the tokenizer should be good
view_level(dataset.input_ids[:700], mario_lm.tokenizer)
I've been meaning to change this behavior, but for now this could help I think
Thanks for the response, turns out the offending line was
mario_lm = MarioLM(lm_path=lm_path, tokenizer_path='distilgpt2')
I'm not totally sure why I thought this was a good idea 🙃, using either the upstream tokenizer or saving the tokenizer after training works.
I also tried your suggestion, it works but you also need to manually set the lm tokenizer after doing so:
mario_lm.tokenizer = dataset.tokenizer
Appreciate the help!
Hi, very interesting project! I'm trying to reproduce the results training from base and running into a problem. Using the training notebook with default parameters on 20k steps, the model is converging to a loss of ~0.05. I'm getting reasonable looking outputs sampling from this trained model but the characters look wrong:
Any ideas on what's going wrong here?