Support dataset desegmentation for CJK

Chinese text is typically inputted unsegmented, however some of the datasets online contain segmentation. We should use a de-segmentation script like this one (this script also tries to fix some datasets in Chinese finishing in a comma as opposed to a fulstop, but this can be extracted away from the script):

#!/usr/bin/env python

import re
import sys

re_space = re.compile(r"(?<![a-zA-Z])\s(?![a-zA-Z])", flags=re.UNICODE)
re_final_comma = re.compile("\.$")

for line in sys.stdin:
    line = line[:-1] #EoL
    line = line.strip()
    line.replace(' ', "")
    if line[-1] == '，':
        line = line[:-1] + u"\u3002"
    if line[-1] == ',':
        line = line[:-1] + '.'
    if line[-1] == ' ':
        line = line[:-1]
    line = re_space.sub("", line)
    line = line.replace(",", u"\uFF0C")
    line = re_final_comma.sub(u"\u3002", line)
    print(line)

This script essentially tries to identify Chinese characters and remove in spaces between them. It can probably be improved as currently the space between English words is lost, whereas we should write something more complicated that detects a continues substring of English words and leaves them alone.

mozilla / translations

Support dataset desegmentation for CJK #743