Closed Mottl closed 5 years ago
@stevenbird, please review
@stevenbird, please review
@stevenbird please review this PR, because many great projects (like sumy, for instance) rely on NLTK toolkit in solving NLP-tasks.
I had to use sumy and in order to use it russian, I monkey patched nltk_data container with Mottl's repo content and everything started working brilliantly.
@stevenbird Russian language will be quite important improvement. Thanks!!!
@stevenbird do you need a test-case maybe? I assume the problem is to check whether the commit works for you.
@Mottl, cc @Hiyorimi, @karelin, @buriy: thanks for your contribution and sorry for the long delay
Thank you!
This commit adds Russian language support in
PunktSentenceTokenizer()
. Data was taken from 3 sources: – Articles from Russian Wikipedia (about 1 million sentences); – Common Russian abbreviations from Russian orthographic dictionary, edited by V. V. Lopatin; – Generated names initials.After some research it was found that the single
params.abbrev_types
performs better than together withparams.collocations
andparams.ortho_content
, so the latter were removed from the trained tokenizer.