ikergarcia1996 / Easy-Translate

Easy-Translate is a script for translating large text files with a SINGLE COMMAND. Easy-Translate is designed to be as easy as possible for beginners and as seamlesscustomizable and as possible for advanced users.
Apache License 2.0
189 stars 306 forks source link

feat: Add --keep_special_tokens argument to control special token decoding #4

Closed ruanchaves closed 1 year ago

ruanchaves commented 1 year ago

Description

This PR adds the command line argument --keep_special_tokens, thus removing the hardcoded True value for skip_special_tokens in tokenizer.batch_decode.

The reasoning behind this PR is that users should be allowed to use the <unk> token as a separator between sentences before translation. This allows users to translate sentence pairs together, instead of separately, thus avoiding a decrease in the lexical overlap between sentence pairs. For more information, refer to the paper "Translation Artifacts in Cross-lingual Transfer Learning" by Mikel Artetxe, Gorka Labaka, Eneko Agirre.

Changes Made

Related Issue

N/A

Additional Information

N/A

ikergarcia1996 commented 1 year ago

Interesting use case, than you!