Open Meorge opened 2 years ago
The following code appears to be working, so far:
from aitextgen.TokenDataset import TokenDataset
from aitextgen.tokenizers import train_tokenizer
from aitextgen.utils import GPT2ConfigCPU
from aitextgen import aitextgen
from os import environ
def train():
# The name of the downloaded Shakespeare text for training
file_name = "shakespeare.txt"
# Train a custom BPE Tokenizer on the downloaded text
# This will save one file: `aitextgen.tokenizer.json`, which contains the
# information needed to rebuild the tokenizer.
train_tokenizer(file_name)
tokenizer_file = "aitextgen.tokenizer.json"
# attempt to load a model with CPU config
config = GPT2ConfigCPU()
ai = aitextgen(model='minimaxir/hacker-news', tokenizer_file=tokenizer_file, config=config)
# You can build datasets for training by creating TokenDatasets,
# which automatically processes the dataset with the appropriate size.
data = TokenDataset(file_name, tokenizer_file=tokenizer_file, block_size=64)
# Train the model! It will save pytorch_model.bin periodically and after completion to the `trained_model` folder.
# On a 2020 8-core iMac, this took ~25 minutes to run.
ai.train(data, batch_size=8, num_steps=50000, generate_every=5000, save_every=5000)
if __name__ == "__main__":
# disable threading/parallelism to silence warnings
environ["TOKENIZERS_PARALLELISM"] = "false"
environ["OMP_NUM_THREADS"] = "1"
train()
However, it's difficult to tell if it's using the minimaxir/hacker-news
model as a base, as all of the output so far looks very Shakespeare-y and not at all Hacker News-y; looking at the source code, it doesn't appear that it should be, although that could just be me misreading.
I have the following code:
As described in the comments, my goal is to use an existing model as a base to finetune using my own data set. When I run this code, I get the following output:
As of right now, the program hasn't returned to the shell, but there's also no indication that anything is happening (i.e. there's no other output showing up).