Easy-Translate is a script for translating large text files with a SINGLE COMMAND. Easy-Translate is designed to be as easy as possible for beginners and as seamlesscustomizable and as possible for advanced users.
Apache License 2.0
189
stars
306
forks
source link
feat: Add --keep_special_tokens argument to control special token decoding #4
This PR adds the command line argument --keep_special_tokens, thus removing the hardcoded True value for skip_special_tokens in tokenizer.batch_decode.
The reasoning behind this PR is that users should be allowed to use the <unk> token as a separator between sentences before translation. This allows users to translate sentence pairs together, instead of separately, thus avoiding a decrease in the lexical overlap between sentence pairs. For more information, refer to the paper "Translation Artifacts in Cross-lingual Transfer Learning" by Mikel Artetxe, Gorka Labaka, Eneko Agirre.
Changes Made
Added --keep_special_tokens command line argument.
Removed hardcoded True value for skip_special_tokens in tokenizer.batch_decode.
Description
This PR adds the command line argument
--keep_special_tokens
, thus removing the hardcodedTrue
value forskip_special_tokens
intokenizer.batch_decode
.The reasoning behind this PR is that users should be allowed to use the
<unk>
token as a separator between sentences before translation. This allows users to translate sentence pairs together, instead of separately, thus avoiding a decrease in the lexical overlap between sentence pairs. For more information, refer to the paper "Translation Artifacts in Cross-lingual Transfer Learning" by Mikel Artetxe, Gorka Labaka, Eneko Agirre.Changes Made
--keep_special_tokens
command line argument.True
value forskip_special_tokens
intokenizer.batch_decode
.Related Issue
N/A
Additional Information
N/A