thomjur / PyCollocation

Python module to do simple collocation analysis of a corpus.
GNU General Public License v3.0
0 stars 1 forks source link

implementing stop words #17

Open trutzig89182 opened 2 years ago

trutzig89182 commented 2 years ago

Yes, this sounds more reasonable than checking the whole stop word list for every single word. Regarding the jsonl file: since this is our program, you can also implement a special "jsonl" option for doc_type, if you want. This would be very specific (since it would be for Twitter JSON only, I guess)... but why not. Otherwise, it should theoretically work by passing an iterator (class) with doc_type="iterable" that iterates over the jsonl files. An example can be found here in the gensim documentation (section "Training your own model"): https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#sphx-glr-auto-examples-tutorials-run-word2vec-py

/EDIT: Ah, I am unsure how the counting with/without stop words works. Are they also excluded from the total word count? This would be important to know, since it would mean that deleting the corresponding rows in the final results table is too late.

Originally posted by @thomjur in https://github.com/thomjur/PyCollocation/issues/15#issuecomment-1031199158

trutzig89182 commented 2 years ago

the total word count is handled via full_counter, right? Than it would contain the stopwords. One way of excluding them in a final file (if that is wanted) could be to get the value for each stop word before deleting the item and adding it during this process. That would also allow to print out how much words were excluded via the stop list.

But perhaps it would also be better to have to different kinds of stop lists? If we want to exclude interpunctions and links from being counted this would make sense to be applied within the function. It it is about excluding actual words without any expected keyness, we still want them to count as a word for defining the 3 tokens range, wouldn'˝t we? So this would probably be a reason to exclude them after gathering the collocations.

Probably the most difficult part is to have a look at what stop words mean for the statistical measures you started to include.

Originally posted by @trutzig89182 in https://github.com/thomjur/PyCollocation/issues/15#issuecomment-1031432205

thomjur commented 2 years ago

The problem is that I am not a trained (computer) linguist either, and I am not sure which procedure is most common. I think it might be reasonable to leave as many words "in" as possible, otherwise it might be strange if the word counts differ significantly from the number of words of the actual corpus. I think your initial idea to just ignore the stop words in the results table sounds best, but we can check that. Also, I oftentimes take care of deleting the stop words before I feed the documents into a program. Punctuation: I think our current procedure already ignores punctuation... I thought this makes sense, but maybe I am wrong...