mozilla / firefox-translations-training

Training pipelines for Firefox Translations neural machine translation models
https://mozilla.github.io/firefox-translations-training/
Mozilla Public License 2.0
143 stars 31 forks source link

[meta] Train harder to segment languages, like CJK languages #425

Open gregtatum opened 7 months ago

gregtatum commented 7 months ago

For harder to segment languages we have Chinese, Japanese, and Korean. We'll need to implement better tokenization support and segmentation support for these languages in order to train them. This work should happen after training a subset of the easier to segment language in #524.

### Perform basic training
- [ ] https://github.com/mozilla/firefox-translations-training/issues/740
- [ ] https://github.com/mozilla/firefox-translations-training/issues/76
- [ ] https://github.com/mozilla/firefox-translations-training/issues/752
- [ ] #424
- [ ] https://github.com/mozilla/firefox-translations-training/issues/745
- [ ] https://github.com/mozilla/firefox-translations-training/issues/747
- [ ] https://github.com/mozilla/firefox-translations-training/issues/746
- [ ] #45
- [ ] Train a basic teacher model
### Implement advanced features
- [ ] https://github.com/mozilla/firefox-translations-training/issues/741
- [ ] https://github.com/mozilla/firefox-translations-training/issues/743
- [ ] https://github.com/mozilla/firefox-translations-training/issues/744
- [ ] https://github.com/mozilla/firefox-translations-training/issues/742
- [ ] https://github.com/mozilla/firefox-translations-training/issues/749
- [ ] https://github.com/mozilla/firefox-translations-training/issues/750
- [ ] https://github.com/mozilla/firefox-translations-training/issues/751
- [ ] https://github.com/mozilla/firefox-translations-training/issues/753
- [ ] https://github.com/mozilla/firefox-translations-training/issues/748
- [ ] Train a good quantized model
### Run production training
- [ ] Train Chinese
- [ ] Train Japanese
- [ ] Train Korean

Native Speakers

If you are a native speaker (L1 language) in any of these languages and want to help out, feel free to leave a comment on this issue or join us in Firefox Translations on matrix. We can always use help with qualitative model evaluation, and questions regarding language.

gregtatum commented 2 weeks ago

In Gecko we're using the Intl.Segmenter JavaScript API. For the CJK work in Gecko, we're exploring using this for all of our sentence segmentation needs as well, rather than ssplit. This API is backed by ICU which employs localizers and translators for supporting the languages of the world.

I did a brief exploration of what it would take to use these APIs with Python. There are some direct bindings to the C library ICU4C. I wrote some quick wrappers around them to create python iterators. In order to incorporate something like this we would need to rely on the Docker layer to provide the built ICU4C library.

Here is an example implementation: https://gist.github.com/gregtatum/8742d88da1ce78137c2d5d6c5f6ad46b

Summarized with these examples:

text = "Hello, world! This is a test of ICU BreakIterator."
for word in get_word_iterator("en", text):
    # [Hello] [world] [This] [is] [a] [test] [of] [ICU] [BreakIterator]
    print(f"[{word}] ", end="")

print("")

text = "こんにちは世界これは単語分割のテストです"
for word in get_word_iterator("ja", text):
    # [こんにちは] [世界] [これ] [は] [単語] [分割] [の] [テスト] [です]
    print(f"[{word}] ", end="")

print("")

text = "The cat (Felis catus), also referred to as domestic cat or house cat, is a small domesticated carnivorous mammal. It is the only domesticated species of the family Felidae. Advances in archaeology and genetics have shown that the domestication of the cat occurred in the Near East around 7500 BC. It is commonly kept as a house pet and farm cat, but also ranges freely as a feral cat avoiding human contact. Valued by humans for companionship and its ability to kill vermin, the cat's retractable claws are adapted to killing small prey like mice and rats. It has a strong, flexible body, quick reflexes, and sharp teeth, and its night vision and sense of smell are well developed. It is a social species, but a solitary hunter and a crepuscular predator. Cat communication includes vocalizations like meowing, purring, trilling, hissing, growling, and grunting as well as cat body language. It can hear sounds too faint or too high in frequency for human ears, such as those made by small mammals. It secretes and perceives pheromones."
for sentence in get_sentence_iterator("en", text):
    print(f"[{sentence}]")
    # [The cat (Felis catus), also referred to as domestic cat or house cat, is a small domesticated carnivorous mammal. ]
    # [It is the only domesticated species of the family Felidae. ]
    # [Advances in archaeology and genetics have shown that the domestication of the cat occurred in the Near East around 7500 BC. ]
    # [It is commonly kept as a house pet and farm cat, but also ranges freely as a feral cat avoiding human contact. ]
    # [Valued by humans for companionship and its ability to kill vermin, the cat's retractable claws are adapted to killing small prey like mice and rats. ]
    # [It has a strong, flexible body, quick reflexes, and sharp teeth, and its night vision and sense of smell are well developed. ]
    # [It is a social species, but a solitary hunter and a crepuscular predator. ]
    # [Cat communication includes vocalizations like meowing, purring, trilling, hissing, growling, and grunting as well as cat body language. ]
    # [It can hear sounds too faint or too high in frequency for human ears, such as those made by small mammals. ]
    # [It secretes and perceives pheromones.]

text = "猫(英语:Cat)通常指家猫(学名:Felis catus,英语:domestic/house cat),是一种家养小型食肉哺乳动物,属小型猫科动物。根据遗传学及考古学分析,人类养猫的纪录可追溯至10,000年前的新月沃土地区。5300年前在目前中国陕西省华县泉护村所在地,就有猫帮助人类捕捉偷食粮食的老鼠的证据,尽管这一研究还无法证实当时这些猫已被驯养成家猫。古埃及人饲养猫的纪录可追溯至公元前1000年前,以防止老鼠吃掉谷物。现在,猫成为世界上最为广泛的宠物之一,饲养率仅次于狗,可以治疗低落的心情与提供情绪价值等。ASPCA研究行为后发现,每只猫领域范围只要不低于1.7平方米,在垂直环境足够丰富的空间就表现出满意,长期室内饲养的猫平均健康则较长,寿命为12年以上,历史上最长寿的猫则达38岁,也因此在部分城市中选择养猫的家户已经超过了狗。但同时猫是一种掠食动物,大量的猫咪因为弃养接触外部环境,除了在恶劣的卫生引发疫情散布,也会造成城市管理问题。其次,因为部分人类不负责任的放养出门的行为,这些猫威胁了鸟类、啮齿类的生存。"
for sentence in get_sentence_iterator("ja", text):
    print(f"[{sentence}]")
    # [猫(英语:Cat)通常指家猫(学名:Felis catus,英语:domestic/house cat),是一种家养小型食肉哺乳动物,属小型猫科动物。]
    # [根据遗传学及考古学分析,人类养猫的纪录可追溯至10,000年前的新月沃土地区。]
    # [5300年前在目前中国陕西省华县泉护村所在地,就有猫帮助人类捕捉偷食粮食的老鼠的证据,尽管这一研究还无法证实当时这些猫已被驯养成家猫。]
    # [古埃及人饲养猫的纪录可追溯至公元前1000年前,以防止老鼠吃掉谷物。]
    # [现在,猫成为世界上最为广泛的宠物之一,饲养率仅次于狗,可以治疗低落的心情与提供情绪价值等。]
    # [ASPCA研究行为后发现,每只猫领域范围只要不低于1.7平方米,在垂直环境足够丰富的空间就表现出满意,长期室内饲养的猫平均健康则较长,寿命为12年以上,历史上最长寿的猫则达38岁,也因此在部分城市中选择养猫的家户已经超过了狗。]
    # [但同时猫是一种掠食动物,大量的猫咪因为弃养接触外部环境,除了在恶劣的卫生引发疫情散布,也会造成城市管理问题。]
    # [其次,因为部分人类不负责任的放养出门的行为,这些猫威胁了鸟类、啮齿类的生存。]