Closed XapaJIaMnu closed 1 month ago
Hi!
Been working with a student for traditional Chinese (not high quality, only for alignment purposes) and maybe some of my experience can be useful for you.
OPUS (neulab_tedtalksv1_train news_commentary_v14 OPUS_UN_v20090831 OPUS_UNPC_v1_0 OPUS_MultiUN_v1 OPUS_QED_v2_0a
)
14711511 SIMP
2267 MIXED
88777 BOTH
9403 TRAD
WikiMatrix
1141562 SIMP
1047046 TRAD
221854 MIXED
28071 BOTH
1 UNK
CCAligned
9686412 SIMP
109136 MIXED
77473 BOTH
5627 TRAD
For traditional-simplified conversion I found OpenCC that seems to have better support and active development than hanziconv
. Punctuation also needs to be converted aside.
Translating from traditional Chinese to English text from a domain that was not present in the training corpora. That had a considerable amount of characters unknown by the SentencePiece vocabulary. This characters were logographs but also punctuation (I realised that the conversion to traditional didn't convert the ASCII punctuation to Chinese punctuation and they were unkown to the vocab). The result was a lot of random behaviour that wasn't detected with the WMT test sets. This is an example from the output:
Some of the monkeys can be fixed cursed monkey monkey monkey monkey monkey monkey monkey monkey monkey monkey monkey monkey monkey monkey monkey monkey monkey monkey monkey monkey monkey monkey monkey
Only the emperor and his sin's puppets need to do so.
First of all, we focus on the system rather than any kind of skill; only how to interact with the strength of the puppets of the puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet
Some of the monkeys can be fixed to "the monkeys to accept" monkeys; the monkeys are too small, and there are shadows.
Some of the unknown characters that were part of a noun were being translated by 'monkey' or 'puppet' and the absence of an ASCII period at the end caused repetition.
This model (zh_Hant->English) had 20.3 of BLEU score under WMT19 converted to traditional with OpenCC.
I solved the issue of character coverage training with all the traditional converted to pinyin with this script:
from unicodedata import category as cat
from unidecode import unidecode as uni
from pypinyin import pinyin
import sys
# tell if a str contains punctuation
def is_punc(string):
return any([cat(i).startswith('P') for i in string])
for line in sys.stdin:
pyin = pinyin(line.rstrip('\n'))
# Flatten the list and unidecode strings with punctuation
pyin = [uni(i[0]) if is_punc(i[0]) else i[0] for i in pyin]
print(' '.join(pyin))
The model lost 1 point of BLEU and the student a couple more with this approach, but the monkeys disappeared.
Closing in favour of https://github.com/mozilla/firefox-translations-training/issues/425. I've split all the suggestions into different issues and attached them to the meta issue. Let me know if you know of something else that should be done.
Chinese poses several unique challenges not present in other language pairs. I will start this mega-issue and update the individual points that need to happen for those languages to be fully supported
find_corpus.py
checks for those when checking for zh..u'[\u4e00-\u9fff]'
, but this may be improved.import re import sys
re_space = re.compile(r"(?<![a-zA-Z])\s(?![a-zA-Z])", flags=re.UNICODE) re_final_comma = re.compile(".$")
for line in sys.stdin: line = line[:-1] #EoL line = line.strip() line.replace(' ', "") if line[-1] == ',': line = line[:-1] + u"\u3002" if line[-1] == ',': line = line[:-1] + '.' if line[-1] == ' ': line = line[:-1] line = re_space.sub("", line) line = line.replace(",", u"\uFF0C") line = re_final_comma.sub(u"\u3002", line) print(line)
(This script is integrated in the previous copy/paste of script).
All of these steps except 2) Apply to Japanese as well. Japanese tokenizer should be used in place of jieba for japanese.