Bridging India's language barrier by AI-driven multilingual NMT models
Anuvaad is an AI based open source Document Translation Platform to translate documents in Indic languages at scale. Anuvaad provides easy-to-edit capabilities on top the plug & play NMT models.
Separate instances of Anuvaad are deployed to the Supreme Court of India (SUVAS) and Supreme Court of Bangladesh (Amar Vasha) and Diksha (NCERT).
Service | Build Status |
---|---|
Zuul | |
NMT | |
Workflow Manager | |
Aligner | |
User Management | |
Tokeniser | |
Translator |
Read indepth about the architecture and codebase of Anuvaad here : https://anuvaad.sunbird.org
Component | Details |
---|---|
Workflow Manager(WM) | Centralized Orchestrator based on user request. |
Auditor | Python package/library used for formatting , exception handling. |
File Uploader | Microservice to upload and maintain user documents. |
File Converter | Microservice to convert files from one format to other. E.g: .doc to .pdf files. |
Aligner | Microservice accepts source and target sentances and align them to form parallel corpus. |
Tokenizer | Microservice tokenises pragraphs into independently translatable sentences. |
Layout Detector | Microservice interface for Layout detection model. |
Block Segmenter | Handles layout detection miss-classifications , region unifying. |
Word Detector | Word detection. |
Block Merger | An OCR system that extracts texts, images, tables, blocks etc from the input file and makes it avaible in the format which can be utilised by downstream services to perform Translation. This can also be used as an independent product that can perform OCR on files, images, ppts, etc. |
Translator | Translator pushes sentences to IndicTrans which are translated and pushed back during the document translation flow. |
Content Handler | Repository Microservice which maintains and manages all the translated documents |
Translation Memory X(TMX) | System translation memory to facilitate overriding NMT translation with user preferred translation. TMX provides three levels of caching - Global , User , Organisation. |
User Translation Memory(UTM) | System tracks and remembers individual user translations or corrected translations and applies automatically when same sentences are encountered again. |
Component | Details |
---|---|
PRIMA | Layout detection model. |
CRAFT | Used for Line detection. |
Tesseract | Custom trained Tesseract used for OCR. |
IndicTrans | Custom trained Indic NMT model used for translation. |
Component | Details |
---|---|
Apache Kafka | Translator and IndicTrans are integrated through Kafka messaging. |
MongoDB | Primary data storage. |
Redis | Secondary in memory storage. |
Cloud Storage | Samba storage is used to store user input files. |
NGINX | Serve as a redirection server and also takes care of system level configs. Ngnix acts as the gateway. |
Zuul | API Gateway to apply filters on client requests,authenticate,authorize,throttle client requests. |