Develop Hindi models for Tokenization,POS Tagging and NER

the-ethan-hunt commented 5 years ago

Develop models for Hindi from the IIT Bombay English Hindi Parallel Corpus using Cython/spaCy-multilingual.

NirantK commented 5 years ago

@the-ethan-hunt let's avoid using ancient languages, and instead focus on benchmarking existing methods on standardized corpora?

E.g. for tokenization/chunking in Hindi - you could try the MosesTokenizer, spaCy-multinlingual, Google's BERT-Tokenizer (Wordpiece under the hood), and sentencepiece.

I'd love it if we can raise a PR in Cython with Hindi models for Parts of Speech, Tokenization and NER using the datasets from CLTK and elsewhere (e.g. IIT Bombay English Hindi Parallel Corpus ). The Cython PR can then be integrated with spaCy.

I'm happy to sponsor upto $100 personally, specifically for making new datasets in Indian languages. I can possibly arrange corporate sponsorship upto $1000 for this if we make enough progress in the 4 weeks.

NirantK commented 5 years ago

Please please don't implement this paper in particular. This is a bad unreliable paper.

I'd recommend emailing the author for datasets and trying rather modern approaches from say, spaCy and AllenNLP and share those results.

the-ethan-hunt commented 5 years ago

@NirantK , I saw this paper in NLPProgress and wondered about reproducing it in Sanskrit. Anyways I would change it to Hindi itself as you have suggested.

the-ethan-hunt commented 5 years ago

I'm happy to sponsor upto $100 personally, specifically for making new datasets in Indian languages. I can possibly arrange corporate sponsorship upto $1000 for this if we make enough progress in the 4 weeks.

I have tried to research on this @NirantK and have found some solutions; would it be better if we discuss in private?

NirantK commented 5 years ago

@the-ethan-hunt Write to [awesomenlp] [at] [nirantk.com] ?

keon / awesome-nlp

Develop Hindi models for Tokenization,POS Tagging and NER #159