karpathy / nanoGPT

The simplest, fastest repository for training/finetuning medium-sized GPTs.
MIT License
37.49k stars 5.97k forks source link

My own tokenizer #422

Open spcrobocar opened 10 months ago

spcrobocar commented 10 months ago

I am working on using NanoGPT to solve a geometry problem. I would like to use the gpt2 network structure but my own tokenizer. My vocabulary size is 1500. I have my own encode/decode code to convert my data into uint16 array. I am currently using the config/train_gpt2.py configuration file. When I started the training, I saw it print out something like "Defaulting to vocab_size of GPT2 to 50000". I do not need such a large vocabulary size. How can I change the config file to use my own tokenizer and vocubularty?

VatsaDev commented 10 months ago

I believe the Nanogpt supports meta.pkl or meta pickle files for encodings, you could train one with sentence piece.