goru001 / inltk

Natural Language Toolkit for Indic Languages aims to provide out of the box support for various NLP tasks that an application developer might need
https://inltk.readthedocs.io
MIT License
824 stars 163 forks source link

Hindi NER Support for Inltk #43

Open avinsit123 opened 4 years ago

avinsit123 commented 4 years ago

Currently we are working on research project for NER in Hindi. We would like to extend our code and work to add Support for Hindi-NER in NLTK. Our current model(Embeddings->LSTM->CRF) is trained on this dataset http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=2 with 14 tags and has an accuracy around 70%. We are trying to increase the accuracy of model currently. Do you have any contribution guidelines to the project or any specifics which u would like in the NER model? Otherwise, we are really interested to contribute to the project.

goru001 commented 4 years ago

@avinsit123 Thanks for reaching out. It would be great to integrate your work into the iNLTK library.

In order to add support for Hindi NER, it would be great if you can:

  1. Open source your work with Links to Train/Test Data, Approach, Trained Model and Scripts to reproduce the results. Once you do this, I would like to take a look at it and then we''ll take it from there.
  2. Do you also want to support training of the model through iNLTK on custom data in addition to exposing the static model trained on IJCNLP dataset? If we want to do this, we'll have to think through this a bit more - happy to hear what your thoughts are.

Let me know what you think.

avinsit123 commented 4 years ago

@goru001 will mail you the required stuff mentioned above once we have completed the refining model. Currently we have trained our model using several embeddings for eg: fasttext, roberta , etc. using flair's NLP Library. It would be also great to add support in inltk so that users to custom train their NER models.

goru001 commented 4 years ago

@avinsit123 Sure, will wait for your mail. Thanks!

octalpixel commented 4 years ago

@avinsit123 Do you have any resources where I can get similar NER dataset for tamil ?

anuragshas commented 4 years ago

@avinsit123 How about using word level inltk embedding and then xgboost to classify the tokens?