Statistics - Githubissues

thomjur commented 2 years ago

@trutzig89182

I have started adding some basic statistics. So far, I have only implemented the most basic one (MU) calculating collocation_freq divided by expected_collocation_frequency. The implementations are mainly based on Brezina et al.

So, basically you now have an option to indicate if you want to use a statistic (default is "freq" with just counting collocation frequencies as we did before). For testing, I compared the results with the results I got from LancsBox when analyzing our test corpus. Although the numbers are different (they seem to have changed the function parameters), the "clustering" seems to be correct. For the moment, you can see the results when running test.py.

I notice that our tiny project is becoming more and more elaborate, so we should soon start a round of restructuring and documenting, otherwise things might become very messay at some point. But first, we should maybe add the interfaces or/and the general workflow how we would like to operate with our program. What do you think?

Statistics to implement

[x] MU
[ ] Z-Score
[ ] Log-Lik

thomjur commented 2 years ago

I have also added the display to our analysis function for debugging. We can take it out once we know how our output should actually look like.

trutzig89182 commented 2 years ago

Perhaps we should do a videocall at some point. The statistical stuff is way beyond what I use usually, but I am interested to learn – not only about python but also about corpus linguistics. After all my background is primarily qualitative social research. :)

thomjur commented 2 years ago

Yes, I think that is a good idea. I am not a computer linguist either, but I read/applied some stuff in my dissertation. I think association measures are really important because the mere frequencies might lead to misleading results (example: a word only occurs frequently in a context of another word because it occurs frequently in the overall text - it thus might not be of great help when trying to identify the meaning/use of the search term; I think most of the statistics/association measures are ways to deal with this problem).

thomjur / PyCollocation

Statistics #14

Statistics to implement