mt-opus
English-Thai Machine Translation with OPUS data
Data
We used 9 datasets from OPUS to train and validate our models within and across domains (total 5.4M sentence pairs; 68.8M English tokens and 53.1M Thai tokens).
datasets |
nb_sent |
en_tok |
th_tok |
description |
reference |
OpenSubtitles v2018 |
3.5M |
28.4M |
7.8M |
crowdsourced subtitles |
[1] |
JW300 v1 en th |
0.8M |
14.9M |
34.6M |
Jehovah's Witness site |
[2], [3] |
GNOME v1 |
0.5M |
2.3M |
3.5M |
GNOME documentation |
[2] |
QED v2.0a |
0.3M |
4.7M |
1.2M |
crowdsourced educational subtitles |
[2] |
bible-uedin v1 |
0.1M |
3.6M |
2.1M |
the Bible |
[2], [4] |
Tanzil v1 |
93.5k |
2.8M |
3.4M |
the Quran |
[2] |
KDE4 v2 |
92.0k |
0.5M |
0.2M |
KDE4 documentation |
[2] |
Ubuntu v14.10 |
46.6k |
0.4M |
0.2M |
Ubuntu documentation |
[2] |
Tatoeba v20190709 |
1.1k |
6k |
1.7k |
crowdsourced translations |
[2] |
Models
Results
References
- [1] P. Lison and J. Tiedemann, 2016, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)
- [2] J. Tiedemann, 2012, Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)
- [3] Željko Agić, Ivan Vulić: "JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages", In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), 2019. Acknowledge also OPUS by citing the following article: J. Tiedemann, 2012, Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)
- [4] A massively parallel corpus: the Bible in 100 languages, Christos Christodoulopoulos and Mark Steedman, Language Resources and Evaluation, 49