thomjur / PyCollocation

Python module to do simple collocation analysis of a corpus.
GNU General Public License v3.0
0 stars 1 forks source link

Program Structure #15

Open thomjur opened 2 years ago

thomjur commented 2 years ago

@trutzig89182

I have started thinking about the general structure based on the comments you made. Here is a first draft:

PyCollocation drawio

So, there are basically two ways one could interact with this program:

  1. Via CLI with the help of arguments.
  2. By importing the module into Python and using the start_collocation_analysis() function.

What do you think?

Here is the draw.io file that you can also change. We can also use a different program, this is only a first draft.

https://1drv.ms/u/s!AjzmoTNnf_mknqAUBlIqx-OB5jdLAg?e=5WtRyu

thomjur commented 2 years ago

I have started restructuring the program accordingly. I also needed to change the tests, they seem to work.

thomjur commented 2 years ago

Okay, last change for today: One can now choose between doc_type: The standard is iterable (for instance, a list with docs, as you created in test.py). If you want to pass a single document only, you need to set single (I changed test.py where you passed a single doc, and it seems to be working). Paricularly important for the CLI is the doc_type="folder". Here, you can just pass the rel. folder path with the docs and it should work. I have tested it via CLI with:

python analysis.py corpora test 3 3 mu "csv"

It worked and created a csv file with the results in my folder. Note that I have also uploaded the corpora folder that includes your examples in single files.

trutzig89182 commented 2 years ago

Looks great and the core structure seems right to me. I am currently still trying to find out how best to use it with twitter jsonl files, but that is basically on special case of importing PyCollocations, and does not change anythin in the core structure. It once we work on it, stop words lists could be another thing to hand over to the package. I think there would be two ways of doing it: either excluding stop words when collecting the collocations, or executing them when returning results. My fist impuls is that the second option would be better, as it keeps the core function less complex.

thomjur commented 2 years ago

Yes, this sounds more reasonable than checking the whole stop word list for every single word. Regarding the jsonl file: since this is our program, you can also implement a special "jsonl" option for doc_type, if you want. This would be very specific (since it would be for Twitter JSON only, I guess)... but why not. Otherwise, it should theoretically work by passing an iterator (class) with doc_type="iterable" that iterates over the jsonl files. An example can be found here in the gensim documentation (section "Training your own model"): https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#sphx-glr-auto-examples-tutorials-run-word2vec-py

/EDIT: Ah, I am unsure how the counting with/without stop words works. Are they also excluded from the total word count? This would be important to know, since it would mean that deleting the corresponding rows in the final results table is too late.

trutzig89182 commented 2 years ago

the total word count is handled via full_counter, right? Than it would contain the stopwords. One way of excluding them in a final file (if that is wanted) could be to get the value for each stop word before deleting the item and adding it during this process. That would also allow to print out how much words were excluded via the stop list.

But perhaps it would also be better to have to different kinds of stop lists? If we want to exclude interpunctions and links from being counted this would make sense to be applied within the function. It it is about excluding actual words without any expected keyness, we still want them to count as a word for defining the 3 tokens range, wouldn'˝t we? So this would probably be a reason to exclude them after gathering the collocations.

Probably the most difficult part is to have a look at what stop words mean for the statistical measures you started to include.

thomjur commented 2 years ago

The problem is that I am not a trained (computer) linguist either, and I am not sure which procedure is most common. I think it might be reasonable to leave as many words "in" as possible, otherwise it might be strange if the word counts differ significantly from the number of words of the actual corpus. I think your initial idea to just ignore the stop words in the results table sounds best, but we can check that. Also, I oftentimes take care of deleting the stop words before I feed the documents into a program. Punctuation: I think our current procedure already ignores punctuation... I thought this makes sense, but maybe I am wrong...