Chinese text is typically inputted unsegmented, however some of the datasets online contain segmentation. We should use a de-segmentation script like this one (this script also tries to fix some datasets in Chinese finishing in a comma as opposed to a fulstop, but this can be extracted away from the script):
#!/usr/bin/env python
import re
import sys
re_space = re.compile(r"(?<![a-zA-Z])\s(?![a-zA-Z])", flags=re.UNICODE)
re_final_comma = re.compile("\.$")
for line in sys.stdin:
line = line[:-1] #EoL
line = line.strip()
line.replace(' ', "")
if line[-1] == ',':
line = line[:-1] + u"\u3002"
if line[-1] == ',':
line = line[:-1] + '.'
if line[-1] == ' ':
line = line[:-1]
line = re_space.sub("", line)
line = line.replace(",", u"\uFF0C")
line = re_final_comma.sub(u"\u3002", line)
print(line)
This script essentially tries to identify Chinese characters and remove in spaces between them. It can probably be improved as currently the space between English words is lost, whereas we should write something more complicated that detects a continues substring of English words and leaves them alone.
Chinese text is typically inputted unsegmented, however some of the datasets online contain segmentation. We should use a de-segmentation script like this one (this script also tries to fix some datasets in Chinese finishing in a comma as opposed to a fulstop, but this can be extracted away from the script):
This script essentially tries to identify Chinese characters and remove in spaces between them. It can probably be improved as currently the space between English words is lost, whereas we should write something more complicated that detects a continues substring of English words and leaves them alone.