janetzki / GUIDE

Create semantic domain dictionaries for low-resource languages
MIT License
4 stars 0 forks source link

Acquire parallel data from Bloom Library #15

Closed janetzki closed 1 year ago

janetzki commented 1 year ago

image (https://huggingface.co/datasets/sil-ai/bloom-lm/viewer/tpi/train)

Goal

As a developer, I want to use books from the bloom library as supplementary training data to improve the word alignment's quality. This would successively increase the dictionary creator's precision. Motivation: (More data beats more clever algorithms.) more parallel data -> improve alignment -> less FPs -> higher DC precision

Example

The Story of Jonah eng: In those days there was a very large town where many people lived. The town's name was Nineveh. tpi: Long dispela taim i gat wanpela bikpela taun i gat planti manmeri. Nem bilong dispela taun em Nineveh.

Tasks