goru001 / inltk

Natural Language Toolkit for Indic Languages aims to provide out of the box support for various NLP tasks that an application developer might need
MIT License
812 stars 164 forks source link

Integrating with HuggingFace Transformer #41

Open octalpixel opened 4 years ago

octalpixel commented 4 years ago

Hi, Could you give me some insights whether is possible to plug in inltk with huggingface transformer library

parmarsuraj99 commented 4 years ago

I was looking for the same. Maybe we can use multi-lingual transformers. But the question is how to tokenize Indian Languages which have different structure. Is there any way to break them for BPE. I am eager to work on this and contribute.

goru001 commented 4 years ago

@octalpixel , @parmarsuraj99 Thanks for reaching out. Currently, it isn't straightforward/possible to integrate it with the transformers library. I'll be happy have contributions from the community to help with it.

parmarsuraj99 commented 4 years ago

So, we just need a tokenizer trained on Indian languages separately and then we just plug it directly to a LM? Maybe Hindi on SentencePiece attached to HuggingFace BERT. Should I go this way?

goru001 commented 4 years ago

@parmarsuraj99 yes you can use sentencepiece or Huggingface's tokenizers (https://github.com/huggingface/tokenizers) library. I've been working on training BERT Hindi model using the tokenizers and transformers library from Huggingface.

parmarsuraj99 commented 4 years ago

@goru001 I am really excited to work on that. I believe a trained Hindi model would be really efficient to grasp other regional languages as well, as most are similar. Really looking forward for it.