google-research / t5x

Apache License 2.0
2.58k stars 296 forks source link

Getting repetitions after pre-training #1573

Open tonyv opened 2 weeks ago

tonyv commented 2 weeks ago

Hello, I am pre-training T5X to translate to Japanese on a large corpus of text. I tried to translate a simple "Hello", but it ends up repeating the "Hello" in Japanese several times in escape unicode sequences. The number of times it repeats is equivalent to the number of task feature lengths I have defined.

  1. Is there a setting I can tweak to reduce the number of repetitions similar to CTranslate2?
  2. In my preprocessor for the training task, I add the EOS tokens automatically as follows:
    preprocessors=[
        seqio.preprocessors.tokenize,
        seqio.preprocessors.append_eos_after_trim,
    ],
  1. Any tips on how to reduce repetitions?