goru001 / inltk

Natural Language Toolkit for Indic Languages aims to provide out of the box support for various NLP tasks that an application developer might need
https://inltk.readthedocs.io
MIT License
812 stars 164 forks source link

Integrating with HuggingFace Transformer #41

Open octalpixel opened 4 years ago

octalpixel commented 4 years ago

Hi, Could you give me some insights whether is possible to plug in inltk with huggingface transformer library

parmarsuraj99 commented 4 years ago

I was looking for the same. Maybe we can use multi-lingual transformers. But the question is how to tokenize Indian Languages which have different structure. Is there any way to break them for BPE. I am eager to work on this and contribute.

goru001 commented 4 years ago

@octalpixel , @parmarsuraj99 Thanks for reaching out. Currently, it isn't straightforward/possible to integrate it with the transformers library. I'll be happy have contributions from the community to help with it.

parmarsuraj99 commented 4 years ago

So, we just need a tokenizer trained on Indian languages separately and then we just plug it directly to a LM? Maybe Hindi on SentencePiece attached to HuggingFace BERT. Should I go this way?

goru001 commented 4 years ago

@parmarsuraj99 yes you can use sentencepiece or Huggingface's tokenizers (https://github.com/huggingface/tokenizers) library. I've been working on training BERT Hindi model using the tokenizers and transformers library from Huggingface.

parmarsuraj99 commented 4 years ago

@goru001 I am really excited to work on that. I believe a trained Hindi model would be really efficient to grasp other regional languages as well, as most are similar. Really looking forward for it.