Closed the-ethan-hunt closed 5 years ago
@the-ethan-hunt let's avoid using ancient languages, and instead focus on benchmarking existing methods on standardized corpora?
E.g. for tokenization/chunking in Hindi - you could try the MosesTokenizer, spaCy-multinlingual, Google's BERT-Tokenizer (Wordpiece under the hood), and sentencepiece.
I'd love it if we can raise a PR in Cython with Hindi models for Parts of Speech, Tokenization and NER using the datasets from CLTK and elsewhere (e.g. IIT Bombay English Hindi Parallel Corpus ). The Cython PR can then be integrated with spaCy.
I'm happy to sponsor upto $100 personally, specifically for making new datasets in Indian languages. I can possibly arrange corporate sponsorship upto $1000 for this if we make enough progress in the 4 weeks.
Please please don't implement this paper in particular. This is a bad unreliable paper.
I'd recommend emailing the author for datasets and trying rather modern approaches from say, spaCy and AllenNLP and share those results.
@NirantK , I saw this paper in NLPProgress and wondered about reproducing it in Sanskrit. Anyways I would change it to Hindi itself as you have suggested.
I'm happy to sponsor upto $100 personally, specifically for making new datasets in Indian languages. I can possibly arrange corporate sponsorship upto $1000 for this if we make enough progress in the 4 weeks.
I have tried to research on this @NirantK and have found some solutions; would it be better if we discuss in private?
@the-ethan-hunt Write to [awesomenlp] [at] [nirantk.com] ?
Develop models for Hindi from the IIT Bombay English Hindi Parallel Corpus using Cython/spaCy-multilingual.