swapniljadhav1921 / asamiasami

State-Of-The-Art & ready to use mini NLP models for Indian Languages
MIT License
44 stars 10 forks source link
ai bert deep-learning fairseq indic-languages language-detection machine-learning neural-machine-translation nlp nmt nsfw-detection

NLP Models For Indian Languages

Google's Multilingual BERT is trained on Indian language's content having contribution <10%. Similarly, for GPT-3 which is the latest in the bunch has <7% content in other than English language. Over the years through experiments we observed that more the data & accurate the data, better the model ... irrespective of how big the model is. Original attention model by Vaswani with more data & hyper-parameter tuning held up very well against state-of-the-art models like BERT, GPT-2. minIndicBERT is the results of the same experimentation and trained only on Indian Languages specifically.

Machine Instances Used

Data

Installation

Requirements

Install Fairseq

This particular commit of fairseq is the best compatible for this project. Later commits produce errors.

gdown https://drive.google.com/uc?id=19Dw1WMRKyDOBxzmvbU_Gc9WgdZuMVt_h
tar -xzvf fairseq.tar.gz
cd fairseq
pip install --editable ./
cd ..

Install git LFS

curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt-get install git-lfs
git lfs install

Issues with LFS

Due to various issues with LFS files initially added to LFS later removed .. created unstable file versions which are currently present in repo.

File sizes are big and github with free version has size limitations.

I propose to use files from this location -> https://drive.google.com/drive/folders/18x_vGGa5v3jT-Zx73u0eKFfDGyw9M_aB?usp=sharing

Same folder structure ... please replace git files with these files ... and then LFS is not required.

Please update if found any issue here -> https://github.com/swapniljadhav1921/asamiasami/issues/2

Very non efficient way .. but will make it more usable later.

Install AsamiAsami

git clone https://github.com/swapniljadhav1921/asamiasami.git
cd asamiasami

For more details please check asasmiasami.py which has simple code interface. You can set gpu or cpu in class construction variable run_option.

indicTranslation

minIndicBERT

Process to Finetune

cd fairseq_installation_path

CUDA_VISIBLE_DEVICES=0 python train.py /path/bin_data/ --restore-file $ROBERTA_PATH --max-positions 512 --max-sentences $MAX_SENTENCES --max-tokens 32768 --task sentence_prediction --reset-optimizer --reset-dataloader --reset-meters --required-batch-size-multiple 1 --init-token 0 --separator-token 2 --arch roberta_base --encoder-layers 4 --encoder-embed-dim 512 --encoder-ffn-embed-dim 1024 --encoder-attention-heads 8 --criterion sentence_prediction --classification-head-name $HEAD_NAME --num-classes $NUM_CLASSES --dropout 0.1 --attention- dropout 0.1 --weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 --clip-norm 0.0 --lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES --fp16 --fp16-init-scale 4 --threshold-loss-scale 1 --fp16-scale-window 128 --max-epoch 16 --best-checkpoint-metric accuracy --maximize-best-checkpoint-metric --find-unused-parameters --update-freq 8 --skip-invalid-size-inputs-valid-test


## minIndicLanguageDetector
* RoBERTa model finetuned over minIndicBERT base model to detect language of a given text
* Input needs 512 tokens, sentence tokenizer has ~66k dictionary of tokens across 12+languages & transliterated text.
* Languages Supported : 'english', 'gujarati', 'nepali', 'malayalam', 'kannada', 'marathi', 'hindi', 'bangla', 'tamil', 'telugu', 'punjabi', 'urdu', 'oriya'
* Code Sample
from asamiasami import minIndicLanguageDetector
model = minIndicLanguageDetector(run_option="gpu")
model.getLanguage("Sample Text For Which Language To Be Detected")
```

minIndicNSFWDetector