vistec-AI / mt-opus

English-Thai Machine Translation with OPUS data
19 stars 5 forks source link


English-Thai Machine Translation with OPUS data


We used 9 datasets from OPUS to train and validate our models within and across domains (total 5.4M sentence pairs; 68.8M English tokens and 53.1M Thai tokens).

datasets nb_sent en_tok th_tok description reference
OpenSubtitles v2018 3.5M 28.4M 7.8M crowdsourced subtitles [1]
JW300 v1 en th 0.8M 14.9M 34.6M Jehovah's Witness site [2], [3]
GNOME v1 0.5M 2.3M 3.5M GNOME documentation [2]
QED v2.0a 0.3M 4.7M 1.2M crowdsourced educational subtitles [2]
bible-uedin v1 0.1M 3.6M 2.1M the Bible [2], [4]
Tanzil v1 93.5k 2.8M 3.4M the Quran [2]
KDE4 v2 92.0k 0.5M 0.2M KDE4 documentation [2]
Ubuntu v14.10 46.6k 0.4M 0.2M Ubuntu documentation [2]
Tatoeba v20190709 1.1k 6k 1.7k crowdsourced translations [2]


