huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.41k stars 27.1k forks source link

Language Modeling Task (GPT2 / CLM) Does Not Generate Line Breaks? #10269

Closed ColinConwell closed 3 years ago

ColinConwell commented 3 years ago

The legacy run_language_modeling.py script produced output that respected line breaks in the train_data_file. The updated run_clm.py script does not. I imagine this is likely due to how the dataset is processed in the new script, but if it is, how do I intervene and fix it?

Environment info

Who can help

Models:

Library:

Documentation: @sgugger

HF projects:

Examples:

Information

Model I am using (Bert, XLNet ...): GPT2

The problem arises when using:

The tasks I am working on is:

To reproduce

Steps to reproduce the behavior:

  1. Download https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
  2. python run_clm.py with --train_file set to input.txt
  3. Instantiate finetuned GPT2 model and use model.generate to create new sequence

Colab notebooks may be found below: Original (with legacy run_language_modeling.py): https://colab.research.google.com/drive/1ieS4TuaFNJhuunaAM9wVmyp-n8Yx9_la?usp=sharing

Updated (with updated run_clm.py): https://colab.research.google.com/drive/1dqIzv7WLk7sDOmFhLdMDhyKCIEcvw3lB?usp=sharing

Expected behavior

When using the legacy run_language_modeling.py script, the output is as expected, with the correct line breaks:

Screen Shot 2021-02-18 at 4 54 21 PM

When running the updated run_clm.py script, line breaks are conspicuously missing:

Screen Shot 2021-02-18 at 4 54 37 PM

Is there a straightforward way to remedy this?

My thanks as always for this wonderful repo, all your hard work, and any assistance you might be able to provide.

jncasey commented 3 years ago

Hi Colin. I ran into this same issue when I switched over to using the datasets library to load my poetry corpus, where line breaks are super important.

I ended up making a slightly modified version of the built-in text loader called text_with_linebreaks, changing line 62 to batch = batch.splitlines(True) to keep the newlines.

ColinConwell commented 3 years ago

@jncasey Thanks for the rapid reply! I figured the culprit here might be the switch over to huggingface/datasets. How did you end up incorporating this into your workflow? Did you modify other scripts to reference text_with_linebreaks?

jncasey commented 3 years ago

Yes, my training script is a sloppily modified version of the run_clm.py example. I added a new training arg for whether to keep the line breaks, and check for that arg in the section where the script determines which loader to use based on the file extension of the data files.

sgugger commented 3 years ago

Cc @lhoestq to see how we could surface that functionality more easily.

lhoestq commented 3 years ago

Maybe let's add a keep_linebreaks parameter to the text loader ? What do you think ? This is already a feature request: https://github.com/huggingface/datasets/issues/870

ColinConwell commented 3 years ago

Thanks for the rapid replies, and relevant updates. would there be interest then in surfacing this new functionality an extra level to the run_[c]lm.py script? or should we just modify the relevant load_dataset call in that script?

sgugger commented 3 years ago

We will do that as soon as there is a new release of datasets to pin in the requirements! For now changing the load_dataset in the script if you have a source install is the best way.

ColinConwell commented 3 years ago

That seems a fine enough solution to me. Thanks again for the assistance. I'll close the issue for now.