Open TviNet opened 5 years ago
@TviNet Thanks for reaching out! I glanced over LM-LSTM-CRF repo, and saw that they're considering every space separated word as a token. I think you can do that for Indic languages as well. But in this case you might not be able to use transfer learning (use pretrained LMs ) (I might be wrong here, need to dig deep into repo, but a quick glance at it makes me think this way).
The way I was thinking of tackling POS is to use transfer learning by doing some pre-processing over the dataset, which would be - breakdown every word into its token (using what we have in iNLTK) and their corresponding tags into
Let me know what your thoughts are. Thanks!
I tried averaging subtokens and then an LSTM+CRF which gave decent results for Hindi ( 13k train sentences, 96.3% accuracy) but not for Tamil (400 train sentences, 87% accuracy). Other languages similarly have very few training samples.
Yes, that's why I think using transfer learning is important here, especially for low resource languages.
Hi,
In case if you are interested in a BiLSTM based Tamil POS tagger (this developed using Stanza framework): https://github.com/sarves/thamizhi-pos You can find relevant models and tagged data.
Sarves
https://universaldependencies.org/ has labelled data for parts of speech, dependencies and information about morphology for Hindi, Sanskrit, Marathi, Tamil and Telugu. I plan on using a LM-LSTM-CRF architecture for sequence tagging. However the language models in iNLTK use sentencepiece tokens. Could anyone guide me through using the existing lm for word tokens or do I need to retrain the word embeddings for word tokens?