Closed ColinConwell closed 3 years ago
Hi Colin. I ran into this same issue when I switched over to using the datasets library to load my poetry corpus, where line breaks are super important.
I ended up making a slightly modified version of the built-in text loader called text_with_linebreaks, changing line 62 to batch = batch.splitlines(True)
to keep the newlines.
@jncasey Thanks for the rapid reply! I figured the culprit here might be the switch over to huggingface/datasets. How did you end up incorporating this into your workflow? Did you modify other scripts to reference text_with_linebreaks?
Yes, my training script is a sloppily modified version of the run_clm.py example. I added a new training arg for whether to keep the line breaks, and check for that arg in the section where the script determines which loader to use based on the file extension of the data files.
Cc @lhoestq to see how we could surface that functionality more easily.
Maybe let's add a keep_linebreaks
parameter to the text loader ? What do you think ?
This is already a feature request: https://github.com/huggingface/datasets/issues/870
Thanks for the rapid replies, and relevant updates. would there be interest then in surfacing this new functionality an extra level to the run_[c]lm.py script? or should we just modify the relevant load_dataset call in that script?
We will do that as soon as there is a new release of datasets to pin in the requirements! For now changing the load_dataset
in the script if you have a source install is the best way.
That seems a fine enough solution to me. Thanks again for the assistance. I'll close the issue for now.
The legacy run_language_modeling.py script produced output that respected line breaks in the train_data_file. The updated run_clm.py script does not. I imagine this is likely due to how the dataset is processed in the new script, but if it is, how do I intervene and fix it?
Environment info
transformers
version: 4.4.0.dev0Who can help
Models:
Library:
Documentation: @sgugger
HF projects:
Examples:
Information
Model I am using (Bert, XLNet ...): GPT2
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
Colab notebooks may be found below: Original (with legacy run_language_modeling.py): https://colab.research.google.com/drive/1ieS4TuaFNJhuunaAM9wVmyp-n8Yx9_la?usp=sharing
Updated (with updated run_clm.py): https://colab.research.google.com/drive/1dqIzv7WLk7sDOmFhLdMDhyKCIEcvw3lB?usp=sharing
Expected behavior
When using the legacy run_language_modeling.py script, the output is as expected, with the correct line breaks:
When running the updated run_clm.py script, line breaks are conspicuously missing:
Is there a straightforward way to remedy this?
My thanks as always for this wonderful repo, all your hard work, and any assistance you might be able to provide.