Implement corpus specific fixes for CJK

mozilla / firefox-translations-training

Training pipelines for Firefox Translations neural machine translation models

https://mozilla.github.io/firefox-translations-training/

Mozilla Public License 2.0

153 stars 33 forks source link

Implement corpus specific fixes for CJK #744

Open eu9ene opened 3 months ago

eu9ene commented 3 months ago

Nikolay: The UN corpus doesn't contain fulstops (for example) and we use something like this to fix it:

import sys

for line in sys.stdin:
    line = line[:-1] #EoL
    if line[-1] == ',':
        line = line[:-1] + '.'
    if line[-1] == ' ':
        line = line[:-1]
    print(line)

(This script is integrated in the previous copy/paste of script).

ZJaume commented 3 days ago

There are also corpora that (maybe it was UN) do not have ideographic full stop character and they have ascii fullstop instead. That should be fixed.

gregtatum commented 1 day ago

Is it appropriate to do data augmentation at training time to swap between . and 。?

ZJaume commented 1 day ago

If the model has to be robust, probably it is a good thing to do? But in the case where chinese is target language, everything should be normalized to the ideographic punctuation.