smc / malayalam-text-classifier

Malayalam Text Classifier
MIT License
6 stars 0 forks source link

♻️ Reproduce the work of ULMFiT in Vaaku2Vec #3

Open kurianbenoy opened 2 years ago

kurianbenoy commented 2 years ago

Vaakue2Vec claims to be the State-of-the-Art Language Modeling and Text Classification in Malayalam Language.

ℹ️ We trained a Malayalam language model on the Wikipedia article dump from Oct, 2018. The Wikipedia dump had 55k+ articles. The difficuly in training a Malayalam language model is text tokenization, since Malayalam is a highly inflectional and agglutinative language. In the current model, we are using nltk tokenizer (will try better alternative in the future) and the vocab size is 30k. The language model was used to train a classifier which classifies a news into 5 categories (India, Kerala, Sports, Business, Entertainment). Our classifier came out with a whooping 92% accuracy in the classification task.

Note

Since it has been almost three years since the work, I am assuming a few things have changed. Like fastai version2, which will make the effort to reproduce a bit difficult. Also the dataset has not been made public fully by the authors of work