My own tokenizer - Githubissues

I am working on using NanoGPT to solve a geometry problem. I would like to use the gpt2 network structure but my own tokenizer. My vocabulary size is 1500. I have my own encode/decode code to convert my data into uint16 array. I am currently using the config/train_gpt2.py configuration file. When I started the training, I saw it print out something like "Defaulting to vocab_size of GPT2 to 50000". I do not need such a large vocabulary size. How can I change the config file to use my own tokenizer and vocubularty?

karpathy / nanoGPT

My own tokenizer #422