MASAKHANE is an research effort for NLP for African languages that is OPEN SOURCE, CONTINENT-WIDE, DISTRIBUTED and ONLINE. This GitHub repository houses the data, code, results and research for building open baseline NLP results for African languages.
Website: masakhane.io
For Africa: To build and facilitate a community of NLP researchers, connect and grow it, spurring and sharing further research, build helpful tools for applications in government, medicine, science and education, to enable language preservation and increase its global visibility and relevance.
For NLP Research: To build data sets and tools to facilitate NLP research on African languages, and to pose new research problems to enrich the NLP research landscape.
For the global researchers community: To discover best practices for distributed research, to be applied by other emerging research communities.
There are many ways to contribute to MASAKHANE.
Want more details? Check out our current initiatives
Join our Slack
Request to join our Google Group
This is so we can feature you on our webpage masakhane.io. Please email the following to masakhanetranslation@gmail.com:
Please be patient with a response via our email address, we're very behind on our administration, in the time of COVID-19.
Typically, if you have some programming experience, we encourage you to start on your journey with Masakhane, by building a baseline for your language. Feeling nervous to submit or not sure where to start? Please join our weekly meeting and we will pair you with a mentor!
We have an example colab notebook which trains a model for English-to-Zulu translation. You can select it by going to the GitHub section when opening a new project.
This is a huge challenge, but luckily we have a place to start! At ACL 2019, this paper was published. The short story? Turns out the Jehovah's Witness community has been translating many many documents and not all of them are religious. And their language representation is DIVERSE.
Check out this spreadsheet HERE to see if your language is featured, then go to Opus to find the links to the data: http://opus.nlpl.eu/JW300.php
We also provide a script for easy downloading and BPE-preprocessing of JW300 data from OPUS: jw300_utils/get_jw300.py
. It requires the installation of the opustools-pkg Python package. Example: For dowloading and pre-processing the Acholi (ach) and the Nyaneka (nyk) portions of JW300, call the script like this:
python get_jw300.py ach nyk --output_dir jw300
Then we still have some options! Our community has been searching wide and far! Join our slack and google group to discuss a way forward!
Your next step is to use the JW300 dataset in the colab notebook and run it. Most pieces of advice are within the notebook itself. We are constantly improving that notebook and are open to any recommendations. Struggled to get going? Then let's work together to build a notebook that's easier to use! Create a github issue or email us!
Amazing! You're created your first baseline. Now we need to get the code and data and results into this github repository
In order for us to consider your result submission official, we need a couple of things:
The notebook that will run the code. The notebook MUST run on on someone else account and the data that it uses should be publically accessible (i.e. if I download the notebook and run it, it must work - so shouldn't be using any private files). If you're wondering how to do this, don't fear! Drop us a line and we will work together to make sure the submission is all good! :)
The test sets - in order to replicate this and test against your results, we need saved test sets uploaded separately.
A README.md that describes the (a) the data used - esp important if it's a combination of sources (b) any interesting changes to the model (c) maybe some analysis of some sentences of the final model
The model itself. This can be in the form of a google drive or dropbox link. We will be finding a home for our trained models soon. For models to be used for transfer learning, further trained, or deployed, you need to provide:
.ckpt
file),src_vocab.txt
, trg_vocab.txt
),config.yaml
),The results - the train, dev, and test set BLEU score
We will be further expanding our analysis techniques so it's super important we have a copy of the model and test sets now so we don't need to rerun the training just to do the analysis
Once you have all of the above, please create a pull request into the repository. See guidelines here.
Also see this as an example for the structure of your contribution
Structure:
/benchmarks
/<src-lang>-<tgt-lang>
/<technique> -- this could be "jw300-baseline" or "fine-tuned-baseline" or "nig-newspaper-dataset"
- notebook.ipynb
- README.md
- test.src
- test.tgt
- results.txt
- src_vocab.txt
- trg_vocab.txt
- src.bpe
- [trg.bpe if the bpe model is not joint with src]
- config.yaml
- any other files, if you have any
Example:
/benchmarks
/en-xh
/xhnavy-data-baseline
- notebook.ipynb
- README.md
- test.xh
- test.en
- results.txt
- src_vocab.txt
- trg_vocab.txt
- en-xh.4000.bpe
- config.yaml
- preprocessing.py
Here is a link to a pull request that has the relevant things.
Feeling nervous about contributing your first pull request or unsure how to proceed? Please don't feel discouraged! Drop us an email or a slack message and we will work together to get your contribution in ship shape!
Cool! So there are many ways to improve results. We've highlighed a few of these in this document. Got other ideas? Drop us a line or submit a PR!
We'd like to highlight how NONE of the trained models are suitable for production usage. In our paper here we explore the performance effects of training such a model on the JW300 datasets - the models are still unable to generalize to non-religious domains. As a rule, one should never deploy an NLP model in a domain that it has not been trained for. And even if it IS trained on the relevant domain, a model should be analysed in detail to understand the biases and potential harms. These models aim to serve as WORK IN PROGRESS to spur more research, and to better understand the failure of such systems.
See Code of Conduct
Bibtex
@article{nekoto2020participatory,
title={Participatory research for low-resourced machine translation: A case study in african languages},
author={{$\forall$}, { } and Nekoto, Wilhelmina and Marivate, Vukosi and Matsila, Tshinondiwa and Fasubaa, Timi and Kolawole, Tajudeen and Fagbohungbe, Taiwo and Akinola, Solomon Oluwole and Muhammad, Shamsuddee Hassan and Kabongo, Salomon and Osei, Salomey and others},
journal={Findings of EMNLP},
year={2020}
}