mozilla / firefox-translations-training

Training pipelines for Firefox Translations neural machine translation models
https://mozilla.github.io/firefox-translations-training/
Mozilla Public License 2.0
143 stars 31 forks source link

Support Chinese discussion #45

Closed XapaJIaMnu closed 1 month ago

XapaJIaMnu commented 2 years ago

Chinese poses several unique challenges not present in other language pairs. I will start this mega-issue and update the individual points that need to happen for those languages to be fully supported

  1. Language detection: Some Chinese corpora are not tagged with zh, but with zh_{tw,zh,hk...} etc. it would be helpful if find_corpus.py checks for those when checking for zh..
  2. Chinese script comes in traditional and simplified variety. Most big translation vendors support both. Converting traditional to simplified (and vice versa) can be easily achieved through hanzi-conv https://pypi.org/project/hanziconv/0.3/ . There might be a very small information loss when converting simplified to traditional, but it should be fine in 99.9% of the cases. Some datasets such as ted talks come in traditional, so they should be converted before using.
  3. Preprocessing filters: https://github.com/mozilla/firefox-translations-training/blob/3b3f33bf2581238d325f05015123fc0a026c394e/pipeline/clean/tools/clean_parallel.py#L51 Chinese alphabet should be added. In general we can use a unicode ranges to do so, but they are somewhat complicated: https://stackoverflow.com/questions/43418812/check-whether-a-string-contains-japanese-chinese-characters In the past i have used something like u'[\u4e00-\u9fff]', but this may be improved.
  4. Segmentation. Chinese text is typically inputted unsegmented, however some of the datasets online contain segmentation. We should use a de-segmentation script like this one (this script also tries to fix some datasets in Chinese finishing in a comma as opposed to a fulstop, but this can be extracted away from the script):
    
    #!/usr/bin/env python

import re import sys

re_space = re.compile(r"(?<![a-zA-Z])\s(?![a-zA-Z])", flags=re.UNICODE) re_final_comma = re.compile(".$")

for line in sys.stdin: line = line[:-1] #EoL line = line.strip() line.replace(' ', "") if line[-1] == ',': line = line[:-1] + u"\u3002" if line[-1] == ',': line = line[:-1] + '.' if line[-1] == ' ': line = line[:-1] line = re_space.sub("", line) line = line.replace(",", u"\uFF0C") line = re_final_comma.sub(u"\u3002", line) print(line)

This script essentially tries to identify Chinese characters and remove  in spaces between them. It can probably be improved as currently the space between English words is lost, whereas we should write something more complicated that detects a continues substring of English words and leaves them alone.

5.  Length filtering. As Chinese sentences come normally as one continuous string of characters, traditional length filtering doesn't work. Furthermore, as one word can be made of 1-4 Chinese characters, we can't have some hard-and-fast conversion rule. What people normally do is they use a Chinese tokenizer (like jieba https://github.com/fxsjy/jieba#jieba-1 ) to split the Chinese text to words.  We can then safely apply the filtering here: https://github.com/mozilla/firefox-translations-training/blob/3b3f33bf2581238d325f05015123fc0a026c394e/pipeline/clean/tools/clean_parallel.py#L93
Most papers recommend to discard lines where the ratio of English to Chinese or Chinese to English words is more than 1.3

Afterwards the text should be de-segmented again and prepared for training

6. Corpus specific fixes. The UN corpus doesn't contain fulstops (for example) and we use something like this to fix it:
```python
import sys

for line in sys.stdin:
    line = line[:-1] #EoL
    if line[-1] == ',':
        line = line[:-1] + '.'
    if line[-1] == ' ':
        line = line[:-1]
    print(line)

(This script is integrated in the previous copy/paste of script).

All of these steps except 2) Apply to Japanese as well. Japanese tokenizer should be used in place of jieba for japanese.

ZJaume commented 2 years ago

Hi!

Been working with a student for traditional Chinese (not high quality, only for alignment purposes) and maybe some of my experience can be useful for you.

OPUS (neulab_tedtalksv1_train news_commentary_v14 OPUS_UN_v20090831 OPUS_UNPC_v1_0 OPUS_MultiUN_v1 OPUS_QED_v2_0a)

14711511 SIMP
   2267 MIXED
  88777 BOTH
   9403 TRAD

WikiMatrix

1141562 SIMP
1047046 TRAD
 221854 MIXED
  28071 BOTH
      1 UNK

CCAligned

9686412 SIMP
 109136 MIXED
  77473 BOTH
   5627 TRAD

This model (zh_Hant->English) had 20.3 of BLEU score under WMT19 converted to traditional with OpenCC.

ZJaume commented 2 years ago

I solved the issue of character coverage training with all the traditional converted to pinyin with this script:

from unicodedata import category as cat
from unidecode import unidecode as uni
from pypinyin import pinyin
import sys

# tell if a str contains punctuation
def is_punc(string):
    return any([cat(i).startswith('P') for i in string])

for line in sys.stdin:
    pyin = pinyin(line.rstrip('\n'))
    # Flatten the list and unidecode strings with punctuation
    pyin = [uni(i[0]) if is_punc(i[0]) else i[0] for i in pyin]
    print(' '.join(pyin))

The model lost 1 point of BLEU and the student a couple more with this approach, but the monkeys disappeared.

eu9ene commented 1 month ago

Closing in favour of https://github.com/mozilla/firefox-translations-training/issues/425. I've split all the suggestions into different issues and attached them to the meta issue. Let me know if you know of something else that should be done.