Add Russian language for PunktSentenceTokenizer()

nltk / nltk_data

NLTK Data

1.45k stars 1.04k forks source link

Add Russian language for PunktSentenceTokenizer() #118

Closed Mottl closed 5 years ago

Mottl commented 6 years ago

This commit adds Russian language support in PunktSentenceTokenizer(). Data was taken from 3 sources: – Articles from Russian Wikipedia (about 1 million sentences); – Common Russian abbreviations from Russian orthographic dictionary, edited by V. V. Lopatin; – Generated names initials.

After some research it was found that the single params.abbrev_types performs better than together with params.collocations and params.ortho_content, so the latter were removed from the trained tokenizer.

Mottl commented 6 years ago

@stevenbird, please review

Mottl commented 5 years ago

@stevenbird, please review

Hiyorimi commented 5 years ago

@stevenbird please review this PR, because many great projects (like sumy, for instance) rely on NLTK toolkit in solving NLP-tasks.

I had to use sumy and in order to use it russian, I monkey patched nltk_data container with Mottl's repo content and everything started working brilliantly.

karelin commented 5 years ago

@stevenbird Russian language will be quite important improvement. Thanks!!!

buriy commented 5 years ago

@stevenbird do you need a test-case maybe? I assume the problem is to check whether the commit works for you.

stevenbird commented 5 years ago

@Mottl, cc @Hiyorimi, @karelin, @buriy: thanks for your contribution and sorry for the long delay

Hiyorimi commented 5 years ago

Thank you!