jerinphilip / ilmulti

Tooling to play around with multilingual machine translation for Indian Languages.
http://preon.iiit.ac.in/~jerin/bhasha
MIT License
21 stars 4 forks source link
indian-languages machine-translation machine-translation-models multilingual-translation multilingual-translations pytorch tokenizer wrappers

ilmulti

This repository houses tooling used to create the models on the leaderboard of WAT-Tasks. We provide wrappers to models which are trained via pytorch/fairseq to translate. Installation and usage intructions are provided below.

Installation

The code is tested to work with the fairseq-fork which is branched from v0.8.0 and torch version 1.0.0.

# --user is optional

# Check requirements.txt, packages for translation:
# fairseq-ilmt@lrec-2020 and torch  are not enabled by default.
python3 -m pip install -r requirements.txt --user  

# Once requirements are installed, you can install ilmulti into library.

python3 setup.py install --user 

Downloading Models: The script scripts/download-and-setup-models.sh downloads the model and dictionary files required for running examples/mm_all.py. Which models to download can be configured in the script.

A working example using the wrappers in this code can be found in this colab notebook. Thanks @Nimishasri.

Usage

from ilmulti.translator import from_pretrained

translator = from_pretrained(tag='mm-all')
sample = translator("The quick brown fox jumps over the lazy dog", tgt_lang='hi')

The code works with three main components:

1. Segmenter

Also sentence-tokenizer. To handle segmenting a block of text into sentences, accounting for some Indian Language delimiters.

  1. PatternSegmenter: There is a bit crude and rule based implementation contributed by Binu Jasim.
  2. PunktSegmenter: changed this to an unsupervised learnt PunktTokenizer

2. Tokenization

We use SentencePiece to as an unsupervised tokenizer for Indian languages, which works surprisingly well in our experiments. There are trained models on whatever corpora we could find for the specific languages in sentencepiece/models of 4000 vocabulary units and 8000 vocabulary units.

Training a joint SentencePiece over all languages lead to character level tokenization for under-represented languages and since there isn't much to gain due to the difference in scripts, we use individual tokenizers for each language. Combined however, this will have less than 4000 x |#languages| as some common English code mixes come in. This however, makes the MT system robust in some sense to code-mixed inputs.

3. Translator

Translator is a wrapper around a fairseq which we have reused for some web-interfaces and demos.