Language Modeling Task (GPT2 / CLM) Does Not Generate Line Breaks?

ColinConwell commented 3 years ago

The legacy run_language_modeling.py script produced output that respected line breaks in the train_data_file. The updated run_clm.py script does not. I imagine this is likely due to how the dataset is processed in the new script, but if it is, how do I intervene and fix it?

Environment info

general environment: Google Colab
transformers version: 4.4.0.dev0
Platform: Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.9
PyTorch version (GPU?): 1.7.0+cu101 (True)
Tensorflow version (GPU?): 2.4.1 (True)
Using GPU in script?: No
Using distributed or parallel set-up in script?:

Who can help

Models:

gpt2: @patrickvonplaten, @LysandreJik

Library:

text generation: @patrickvonplaten
tokenizers: @n1t0, @LysandreJik
trainer: @sgugger
pipelines: @LysandreJik

Documentation: @sgugger

HF projects:

nlp datasets: different repo
rust tokenizers: different repo

Examples:

maintained examples (not research project or legacy): @sgugger, @patil-suraj

Information

Model I am using (Bert, XLNet ...): GPT2

The problem arises when using:

[x] the official example scripts: run_clm.py | run_language_modeling.py
[x] my own modified scripts: colab notebooks that use these scripts

The tasks I am working on is:

[x] my own task or dataset: Tiny Shakespeare (from text file)

To reproduce

Steps to reproduce the behavior:

Download https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
python run_clm.py with --train_file set to input.txt
Instantiate finetuned GPT2 model and use model.generate to create new sequence

Colab notebooks may be found below: Original (with legacy run_language_modeling.py): https://colab.research.google.com/drive/1ieS4TuaFNJhuunaAM9wVmyp-n8Yx9_la?usp=sharing

Updated (with updated run_clm.py): https://colab.research.google.com/drive/1dqIzv7WLk7sDOmFhLdMDhyKCIEcvw3lB?usp=sharing

Expected behavior

When using the legacy run_language_modeling.py script, the output is as expected, with the correct line breaks:

When running the updated run_clm.py script, line breaks are conspicuously missing:

Is there a straightforward way to remedy this?

My thanks as always for this wonderful repo, all your hard work, and any assistance you might be able to provide.

jncasey commented 3 years ago

Hi Colin. I ran into this same issue when I switched over to using the datasets library to load my poetry corpus, where line breaks are super important.

I ended up making a slightly modified version of the built-in text loader called text_with_linebreaks, changing line 62 to batch = batch.splitlines(True) to keep the newlines.

ColinConwell commented 3 years ago

@jncasey Thanks for the rapid reply! I figured the culprit here might be the switch over to huggingface/datasets. How did you end up incorporating this into your workflow? Did you modify other scripts to reference text_with_linebreaks?

jncasey commented 3 years ago

Yes, my training script is a sloppily modified version of the run_clm.py example. I added a new training arg for whether to keep the line breaks, and check for that arg in the section where the script determines which loader to use based on the file extension of the data files.

sgugger commented 3 years ago

Cc @lhoestq to see how we could surface that functionality more easily.

lhoestq commented 3 years ago

Maybe let's add a keep_linebreaks parameter to the text loader ? What do you think ? This is already a feature request: https://github.com/huggingface/datasets/issues/870

ColinConwell commented 3 years ago

Thanks for the rapid replies, and relevant updates. would there be interest then in surfacing this new functionality an extra level to the run_[c]lm.py script? or should we just modify the relevant load_dataset call in that script?

sgugger commented 3 years ago

We will do that as soon as there is a new release of datasets to pin in the requirements! For now changing the load_dataset in the script if you have a source install is the best way.

ColinConwell commented 3 years ago

That seems a fine enough solution to me. Thanks again for the assistance. I'll close the issue for now.

huggingface / transformers