makcedward / nlpaug

Data augmentation for NLP
https://makcedward.github.io/
MIT License
4.46k stars 463 forks source link

Truncate to model max length in back translation #304

Closed JohnGiorgi closed 2 years ago

JohnGiorgi commented 2 years ago

Currently, there is no truncation on the input text in the back translation augmenter. This leads to hard-to-parse errors when input text longer than the models max input length is provided (and the model is running on the GPU). This PR fixes that by providing the argument truncation=True to the HF tokenizer, which truncates any text longer than the models max input size.

Closes #297