masakhane-io / masakhane-mt

Machine Translation for Africa
MIT License
278 stars 206 forks source link
african-languages machine-translation nlp

Masakhane - A living collection of NLP projects for Africans, by Africans

PRs Welcome Slack Status

MASAKHANE is an research effort for NLP for African languages that is OPEN SOURCE, CONTINENT-WIDE, DISTRIBUTED and ONLINE. This GitHub repository houses the data, code, results and research for building open baseline NLP results for African languages.

Website: masakhane.io

Goals

Hall of Fame for our Contributors

Progress

How can I contribute?

There are many ways to contribute to MASAKHANE.

  1. TRAIN A MODEL - Contribute a trained model and related code for your language
  2. ANALYSIS - Contribute analysis of data/models for any African languages. You do not need any technical experience for this! If you're a linguist, we can pair you up with a machine translation practitioner and you can help contribute analysis
  3. DATA - Help build or find datasets for your language
  4. DOCUMENTATION - Help document our discussions, progress. This is VERY much needed. Or contribute to documentation of the base "notebook" that will improve the experience of others
  5. MENTORSHIP - Provide advice or help tune models for their languages and datasets, or help people get started
  6. ADMIN - Working with so many researchers can be quite a challenge! Help out with administrative tasks
  7. COMPUTE - Help with infrastructure and compute! Do you have spare compute to donate? Let us know! We're always looking for more!
  8. BRAINSTORM Join our weekly meetings, provide advice or ideas
  9. STORY-TELLING - Tell our stories to the world by doing talks about the community, contributing to our Medium publication, or engaging with media outlets
  10. MLOps & ML Engineering - Do you enjoy delving into the MLOps side of machine learning? Are you a software developer looking to hone-in on your ML engineer abilities? Join us to help build tools to support out reproducability, data gathering, and model sharing!

Want more details? Check out our current initiatives

How do I join?

  1. Join our Slack

  2. Request to join our Google Group

  3. This is so we can feature you on our webpage masakhane.io. Please email the following to masakhanetranslation@gmail.com:

    • Your Full Name
    • A preferred social media link
    • The language(s) you'll be working on (or your general relevant specialty - if you're an expert at machine translation and - would like to boost the community through that)
    • A picture
    • Your affiliation and role.

Please be patient with a response via our email address, we're very behind on our administration, in the time of COVID-19.

Building your first machine translation model

Typically, if you have some programming experience, we encourage you to start on your journey with Masakhane, by building a baseline for your language. Feeling nervous to submit or not sure where to start? Please join our weekly meeting and we will pair you with a mentor!

1. Have a look at the example code

Open In Colab
We have an example colab notebook which trains a model for English-to-Zulu translation. You can select it by going to the GitHub section when opening a new project.

2. Finding data for my language?!

This is a huge challenge, but luckily we have a place to start! At ACL 2019, this paper was published. The short story? Turns out the Jehovah's Witness community has been translating many many documents and not all of them are religious. And their language representation is DIVERSE.

Check out this spreadsheet HERE to see if your language is featured, then go to Opus to find the links to the data: http://opus.nlpl.eu/JW300.php

We also provide a script for easy downloading and BPE-preprocessing of JW300 data from OPUS: jw300_utils/get_jw300.py. It requires the installation of the opustools-pkg Python package. Example: For dowloading and pre-processing the Acholi (ach) and the Nyaneka (nyk) portions of JW300, call the script like this: python get_jw300.py ach nyk --output_dir jw300

Can't find your language in the JW300 dataset?

Then we still have some options! Our community has been searching wide and far! Join our slack and google group to discuss a way forward!

3. Run the notebook!

Your next step is to use the JW300 dataset in the colab notebook and run it. Most pieces of advice are within the notebook itself. We are constantly improving that notebook and are open to any recommendations. Struggled to get going? Then let's work together to build a notebook that's easier to use! Create a github issue or email us!

4. It's done! I have results! Now what?

Amazing! You're created your first baseline. Now we need to get the code and data and results into this github repository

In order for us to consider your result submission official, we need a couple of things:

  1. The notebook that will run the code. The notebook MUST run on on someone else account and the data that it uses should be publically accessible (i.e. if I download the notebook and run it, it must work - so shouldn't be using any private files). If you're wondering how to do this, don't fear! Drop us a line and we will work together to make sure the submission is all good! :)

  2. The test sets - in order to replicate this and test against your results, we need saved test sets uploaded separately.

  3. A README.md that describes the (a) the data used - esp important if it's a combination of sources (b) any interesting changes to the model (c) maybe some analysis of some sentences of the final model

  4. The model itself. This can be in the form of a google drive or dropbox link. We will be finding a home for our trained models soon. For models to be used for transfer learning, further trained, or deployed, you need to provide:

    1. a checkpoint with the parameters (.ckpt file),
    2. the source and target vocabulary (src_vocab.txt, trg_vocab.txt),
    3. the configuration file (config.yaml),
    4. and if applicable: the BPE codes or scripts for your pre-processing pipeline. Joey NMT saves the first three in the model directory.
  5. The results - the train, dev, and test set BLEU score

We will be further expanding our analysis techniques so it's super important we have a copy of the model and test sets now so we don't need to rerun the training just to do the analysis

Once you have all of the above, please create a pull request into the repository. See guidelines here.

Structure of my PR:

Also see this as an example for the structure of your contribution

Structure:

/benchmarks
  /<src-lang>-<tgt-lang>
    /<technique> -- this could be "jw300-baseline" or "fine-tuned-baseline" or "nig-newspaper-dataset"
      - notebook.ipynb
      - README.md
      - test.src
      - test.tgt
      - results.txt
      - src_vocab.txt
      - trg_vocab.txt
      - src.bpe
      - [trg.bpe if the bpe model is not joint with src]
      - config.yaml
      - any other files, if you have any

Example:

/benchmarks
  /en-xh
    /xhnavy-data-baseline
      - notebook.ipynb
      - README.md
      - test.xh
      - test.en
      - results.txt
      - src_vocab.txt
      - trg_vocab.txt
      - en-xh.4000.bpe
      - config.yaml
      - preprocessing.py

Here is a link to a pull request that has the relevant things.

Feeling nervous about contributing your first pull request or unsure how to proceed? Please don't feel discouraged! Drop us an email or a slack message and we will work together to get your contribution in ship shape!

5. I've got a baseline. What do I do to improve it?

Cool! So there are many ways to improve results. We've highlighed a few of these in this document. Got other ideas? Drop us a line or submit a PR!

Notes about Model Deployment

We'd like to highlight how NONE of the trained models are suitable for production usage. In our paper here we explore the performance effects of training such a model on the JW300 datasets - the models are still unable to generalize to non-religious domains. As a rule, one should never deploy an NLP model in a domain that it has not been trained for. And even if it IS trained on the relevant domain, a model should be analysed in detail to understand the biases and potential harms. These models aim to serve as WORK IN PROGRESS to spur more research, and to better understand the failure of such systems.

Code of Conduct

See Code of Conduct

Reference

Bibtex

@article{nekoto2020participatory,
  title={Participatory research for low-resourced machine translation: A case study in african languages},
  author={{$\forall$}, { } and Nekoto, Wilhelmina and Marivate, Vukosi and Matsila, Tshinondiwa and Fasubaa, Timi and Kolawole, Tajudeen and Fagbohungbe, Taiwo and Akinola, Solomon Oluwole and Muhammad, Shamsuddee Hassan and Kabongo, Salomon and Osei, Salomey and others},
  journal={Findings of EMNLP},
  year={2020}
}