Closed ionuthulub closed 5 years ago
I made a PR for the stop words.
Where can I get pickle
for the Romanian language?
Do you guys have any answers?
Sorry, you'll need to work this out from the tokenizer documentation, and then submit a PR, referencing a pickled tokenizer file you want to contribute.
Hello,
I'd like to add a list of stop words and also a pickle file for the punkt sentence tokenizer for Romanian. I have a few questions: 1) How long should the corpora on which I train the tokenizer be? I see that the average corpora for other languages is about 400.000 tokens. What is a token in this context, a word or a sentence? 2) Should I just add the pickle file to the archive and make a PR? 3) How do I contribute the stop words file?
Thanks, IH