minimaxir / aitextgen

A robust Python tool for text-based AI training and generation using GPT-2.
https://docs.aitextgen.io
MIT License
1.84k stars 218 forks source link

Is <|startoftext|> broken or deprecated by default? #159

Open MaxGodTier opened 2 years ago

MaxGodTier commented 2 years ago

There're open issues that suggests <|startoftext|> is still being used #88 and #101 . However, does it by default? Because here says otherwise, or it's specifically a feature from aitextgen (such as in gpt-2-simple)? I'm fine-tuning a DistilGPT2 model and <|startoftext|> appears on the generated output.

input = ''
ai = aitextgen(model_folder='trained_model', to_gpu=True)
ai.generate_to_file(n=1000,
                 batch_size=50,
                 prompt=input,
                 min_length=1024,
                 max_length=1024,
                 temperature=0.7,
                 repetition_penalty=1.3,
                 top_p=0.9)

Outputs: <|startoftext|>How much wood would a woodchuck chuck if a woodchuck could chuck wood?

It's not supposed to happen, is it? My train dats is split into multiple files.

data1.txt:

<|startoftext|>one thousand twenty four characters
line break and one thousand twenty four characters
line break and one thousand twenty four characters
final line break and more characters <|endoftext|>

data2.txt:

<|startoftext|>one thousand twenty four characters
line break and one thousand twenty four characters
line break and one thousand twenty four characters
final line break and more characters <|endoftext|>

data3.txt: ...

There's a line break every 1024 characters to avoid this issue #124 All those files are merged into one huge file traindata.txt

Then I tokenize it and generate _datasetcache.tar.gz

from aitextgen.TokenDataset import TokenDataset
data = TokenDataset("traindata.txt")
data.save()

Then I train the model

import logging
logging.basicConfig(
        format="%(asctime)s — %(levelname)s — %(name)s — %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO
    )

from aitextgen import aitextgen
ai = aitextgen(model="distilgpt2", to_gpu=True)
file_name = "dataset_cache.tar.gz"

ai.train(file_name,
         line_by_line=False,
         from_cache=True, # When True it trains by default on dataset_cache.tar.gz
         num_steps=3000000,
         generate_every=1000,
         save_every=1000,
         save_gdrive=False,
         learning_rate=1e-3,
         fp16=False,
         num_workers=0, # Workaround for training on Windows
         batch_size=8, # 85% VRAM usage with 24GB GPU
         )

Did I miss anything? I've checked every DistilGPT2 fine-tuning tutorial with Huggingface's Transformers and they're completely outdated, spent at least a week cross examining contraddicting information, I don't know what to believe anymore.

minimaxir commented 2 years ago

<|startoftext|> was unfortunately a mistake I made when building gpt-2-simple. It's technically redundant, as <|endoftext|> at the start of a text generation serves the same purpose, and requires only a single token. That is how Hugging Face tokenizers implements it.

If you're getting <|startoftext|> on DistilGPT2, that's weird.