goru001 / inltk

Natural Language Toolkit for Indic Languages aims to provide out of the box support for various NLP tasks that an application developer might need
https://inltk.readthedocs.io
MIT License
818 stars 164 forks source link

POS tagging #13

Open TviNet opened 5 years ago

TviNet commented 5 years ago

https://universaldependencies.org/ has labelled data for parts of speech, dependencies and information about morphology for Hindi, Sanskrit, Marathi, Tamil and Telugu. I plan on using a LM-LSTM-CRF architecture for sequence tagging. However the language models in iNLTK use sentencepiece tokens. Could anyone guide me through using the existing lm for word tokens or do I need to retrain the word embeddings for word tokens?

goru001 commented 5 years ago

@TviNet Thanks for reaching out! I glanced over LM-LSTM-CRF repo, and saw that they're considering every space separated word as a token. I think you can do that for Indic languages as well. But in this case you might not be able to use transfer learning (use pretrained LMs ) (I might be wrong here, need to dig deep into repo, but a quick glance at it makes me think this way).

The way I was thinking of tackling POS is to use transfer learning by doing some pre-processing over the dataset, which would be - breakdown every word into its token (using what we have in iNLTK) and their corresponding tags into -> <sometag1, sometag2, sometag3> depending upon the number of tokens it gets broken down into. I think this will yield better model/results. But we should experiment.

Let me know what your thoughts are. Thanks!

TviNet commented 5 years ago

I tried averaging subtokens and then an LSTM+CRF which gave decent results for Hindi ( 13k train sentences, 96.3% accuracy) but not for Tamil (400 train sentences, 87% accuracy). Other languages similarly have very few training samples.

goru001 commented 5 years ago

Yes, that's why I think using transfer learning is important here, especially for low resource languages.

sarves commented 3 years ago

Hi,

In case if you are interested in a BiLSTM based Tamil POS tagger (this developed using Stanza framework): https://github.com/sarves/thamizhi-pos You can find relevant models and tagged data.

Sarves