libindic / indic-trans

The project aims on adding a state-of-the-art transliteration module for cross transliterations among all Indian languages including English.
GNU Affero General Public License v3.0
256 stars 61 forks source link

Dataset for training #31

Closed loretoparisi closed 5 years ago

loretoparisi commented 6 years ago

Hello, reading the paper IIIT-H System Submission for FIRE2014 Shared Task on Transliterated Search I have read about the dataset used for the training that where

• Monolingual corpora of English, Hindi and Gujarati in their native scripts. • Word lists with corpus frequencies for English, Hindi, Ben- gali and Gujarati. • Word transliteration pairs for Hindi-English, Bengali-English and Gujarati-English.

plus additional crawled Romanized data.

Would it be possibile to provide these dataset in order to train the system from scratch?

Thank you.

loretoparisi commented 5 years ago

Closing this.