nltk / nltk_data

NLTK Data
1.47k stars 1.05k forks source link

Adding some resources for the romanian language #79

Closed ionuthulub closed 5 years ago

ionuthulub commented 7 years ago

Hello,

I'd like to add a list of stop words and also a pickle file for the punkt sentence tokenizer for Romanian. I have a few questions: 1) How long should the corpora on which I train the tokenizer be? I see that the average corpora for other languages is about 400.000 tokens. What is a token in this context, a word or a sentence? 2) Should I just add the pickle file to the archive and make a PR? 3) How do I contribute the stop words file?

Thanks, IH

ionuthulub commented 7 years ago

I made a PR for the stop words.

vitaly-zdanevich commented 6 years ago

Where can I get pickle for the Romanian language?

Alegzandra commented 5 years ago

Do you guys have any answers?

stevenbird commented 5 years ago

Sorry, you'll need to work this out from the tokenizer documentation, and then submit a PR, referencing a pickled tokenizer file you want to contribute.