Add tokenizer - Githubissues

thomjur / PyCollocation

Python module to do simple collocation analysis of a corpus.

GNU General Public License v3.0

0 stars 1 forks source link

Add tokenizer #3

Closed trutzig89182 closed 2 years ago

trutzig89182 commented 2 years ago

Writing the unit tests I saw that punctuation signs are considered to be part of the words in our first very rough separation of words with word_list = document.split(" ").

So implementing the a tokenizer is an important next step. You already mentioned the NLTK tokenizer in the script.

thomjur commented 2 years ago

I agree! I will do that later today. Idea: Add parameter to indicate whether we are dealing with tweets or normal texts in order to switch between Tokenizer() and TweetTokenizer().

[x] Implement Tokenizer.

thomjur commented 2 years ago

I have just implemented a simple tokenizer that also allows to tokenize tweets if needed. Default is the normal nltk.tokenize.word_tokenize. It is important to install required nltk packages first (see README.md).

thomjur commented 2 years ago

Seems to be okay for the moment.