hplt-project / OpusCleaner

OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.
https://pypi.org/project/opuscleaner/
45 stars 13 forks source link

Chinese Traditional <-> Simplified #34

Open ZJaume opened 1 year ago

ZJaume commented 1 year ago

The last time I've worked with this it was using OpenCC. It is much more up to date and seems to have an active community. Las release from hanziconv is from 2016.

ZJaume commented 1 year ago

I also leave here this writing system detector which can be useful in the future: https://pypi.org/project/hanzidentifier/

and this simple script to convert all the Chinese characters to Pinyin

from unicodedata import category as cat
from unidecode import unidecode as uni
from pypinyin import pinyin
import sys

# tell if a str contains punctuation
def is_punc(string):
    return any([cat(i).startswith('P') for i in string])

for line in sys.stdin:
    pyin = pinyin(line.rstrip('\n'))
    # Flatten the list and unidecode strings with punctuation
    pyin = [uni(i[0]) if is_punc(i[0]) else i[0] for i in pyin]
    print(' '.join(pyin))

Doing pinyin->eng worked really well to avoid unknown characters in the input messing up the output in a zht->eng model.

XapaJIaMnu commented 1 year ago

Doing pinyin->eng worked really well to avoid unknown characters in the input messing up the output in a zht->eng model.

How did it deal with them though? I'd imagine unknown characters to be names, in which case transliteration would be the target. This bit of code should be a special case of the placeholder.

ZJaume commented 1 year ago

Yes, I think there were mostly names, but maybe terminology also. We were translating domain text quite different from the available corpora at opus. The assumption behind using Pinyin was to try if subwords inside each phoneme could let the model approximate the meaning of an unknown character or word (group of characters). Also the poorly trained unk token was causing the zht->eng model to throw things like this sometimes:

Some of the monkeys can be fixed to "the monkeys to accept" monkeys; the monkeys are too small, and there are shadows.
Is it possible to reform some of the puppets and puppets to make the puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet.

After applying Pinyin we noticed that it was capable of translating everyithing else ok and the names translated approximately or correctly.

It is probable that byte fallback can solve this issue and pinyin conversion isn't needed, but I leave it here just in case. We didn't knew byte fallback option at the time.