How does one preproccess dataset?

mgrankin / ru_transformers

Apache License 2.0

776 stars 108 forks source link

How does one preproccess dataset? #33

Closed fen0s closed 4 years ago

fen0s commented 4 years ago

I've managed to get finetuning running, but hit another wall in how would one go about preproccessing dataset. Apparently, the model doesn't get grasp at default <|startoftext|> tag, and from what i can see in corpus.ipynb, there's just newline <|n|> tag. How would you differentiate start of piece and end of piece for model and tokenizer?

Kepler-Br commented 4 years ago

Same question. I'm using this:

def process_function(path_to_file):
    match = re.compile(r'(?=[^ ])([\W])([\w])')
    match2 = re.compile('(.|\s)\\1\\1+')
    with open(path_to_file, 'r') as f:
        lines = f.read()
    if lines and lines[0] != ' ': lines = ' ' + lines
    lines = match.sub(r'\g<1> \g<2>', lines)
    lines = match2.sub(r'\1'*3, lines)
    path = os.path.join(SAVE_TO_PATH, os.path.split(path_to_file)[1])
    with open(path, 'w') as handle:
        handle.write(lines)

But I highly doubt that this is correct way of doing so

mgrankin commented 4 years ago

Hello, @fen0s @Kepler-Br The dataset I used is a few PDFs of fiction stories. Each of them usually contains several stories. I don't know of a reliable way to separate stories from each other. So I decided to let GPT2 figure it out on its own.

fen0s commented 4 years ago

@Kepler-Br I've managed to make GPT get grasp of endoftext/startoftext tags. Just use special tags (like <|startofarticle|>, because <|startoftext|> seems common for some reason in unfreeze_all model and confuses GPT a bit), and up the learning rate to 0.0001 on small datasets (30-100mb) or 0.001 on VERY small datasets (<10 mb). At least, that's what helped me.