mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
154 stars 33 forks source link

Support dataset desegmentation for CJK #743

Open eu9ene opened 3 months ago

eu9ene commented 3 months ago

Chinese text is typically inputted unsegmented, however some of the datasets online contain segmentation. We should use a de-segmentation script like this one (this script also tries to fix some datasets in Chinese finishing in a comma as opposed to a fulstop, but this can be extracted away from the script):

#!/usr/bin/env python

import re
import sys

re_space = re.compile(r"(?<![a-zA-Z])\s(?![a-zA-Z])", flags=re.UNICODE)
re_final_comma = re.compile("\.$")

for line in sys.stdin:
    line = line[:-1] #EoL
    line = line.strip()
    line.replace(' ', "")
    if line[-1] == ',':
        line = line[:-1] + u"\u3002"
    if line[-1] == ',':
        line = line[:-1] + '.'
    if line[-1] == ' ':
        line = line[:-1]
    line = re_space.sub("", line)
    line = line.replace(",", u"\uFF0C")
    line = re_final_comma.sub(u"\u3002", line)
    print(line)

This script essentially tries to identify Chinese characters and remove in spaces between them. It can probably be improved as currently the space between English words is lost, whereas we should write something more complicated that detects a continues substring of English words and leaves them alone.