Vaakue2Vec claims to be the State-of-the-Art Language Modeling and Text Classification in Malayalam Language.
ℹ️ We trained a Malayalam language model on the Wikipedia article dump from Oct, 2018. The Wikipedia dump had 55k+ articles. The difficuly in training a Malayalam language model is text tokenization, since Malayalam is a highly inflectional and agglutinative language. In the current model, we are using nltk tokenizer (will try better alternative in the future) and the vocab size is 30k. The language model was used to train a classifier which classifies a news into 5 categories (India, Kerala, Sports, Business, Entertainment). Our classifier came out with a whooping 92% accuracy in the classification task.
Note
Since it has been almost three years since the work, I am assuming a few things have changed. Like fastai version2, which will make the effort to reproduce a bit difficult. Also the dataset has not been made public fully by the authors of work
Vaakue2Vec claims to be the State-of-the-Art Language Modeling and Text Classification in Malayalam Language.
ℹ️ We trained a Malayalam language model on the Wikipedia article dump from Oct, 2018. The Wikipedia dump had 55k+ articles. The difficuly in training a Malayalam language model is text tokenization, since Malayalam is a highly inflectional and agglutinative language. In the current model, we are using nltk tokenizer (will try better alternative in the future) and the vocab size is 30k. The language model was used to train a classifier which classifies a news into 5 categories (India, Kerala, Sports, Business, Entertainment). Our classifier came out with a whooping 92% accuracy in the classification task.
Note
Since it has been almost three years since the work, I am assuming a few things have changed. Like fastai version2, which will make the effort to reproduce a bit difficult. Also the dataset has not been made public fully by the authors of work