How to aggregate controversiality score to the cluster level

lauratolosi commented 8 years ago

The following will give you the number of support, deny, question for each cluster. If you want, you can specify the cluster that you want to compute them for.

PREFIX pheme: http://www.pheme.eu/ontology/pheme# PREFIX xsd: http://www.w3.org/2001/XMLSchema# select ?eventId ?support (count(?support) as ?count) where {
?a a pheme:Tweet . ?a pheme:eventId ?eventId. ?a pheme:sdq ?support . ?a pheme:version "v7" . FILTER (xsd:integer(?eventId) > -1) . } group by ?eventId ?support order by ?eventId limit 100

lauratolosi commented 8 years ago

Now, assume that for a cluster we have S support, D deny and Q questions and in total, C = S+D+Q. Non-controversiality is given by:

score_of_noncontroversiality = 1/3* ( (S/C - 1/3)^2 + (D/C - 1/3)^2 + (Q/C - 1/3)^2)

If score is large, theme is non-controversial. If score is close to 0, theme is controversial.

This is based on the chi-squared test for uniform discrete distributions. I will write a formal argument when needed.

lauratolosi commented 8 years ago

Here you can see some example scores, that I ran on some 50 themes. You can see how many tweets with deny, support, question each has. They are sorted by controversiality, first the more controversial, last the less controversial. The last column is the score above, sorted increasingly (here be careful, what's a high score and a high controversiality.. it's the other way around :) )

clusteriD  support deny question       score
1184      22   16       17 0.002277319
1109    5885 4695     3334 0.005610682
1214      12   34       12 0.031972519
1150      15    9        3 0.032921811
106      264   40      142 0.042152913
1142      44   16        9 0.048029126
1070      20   15        1 0.049897119
1149     234    9      191 0.050494218
1195      13   10        0 0.058391094
1228      47    8       15 0.058820862
1023      14    2        5 0.058956916
1051      21   30        0 0.060745867
1002      50    5       16 0.072780974
1096      31   15        0 0.075719387
1076     133   37       10 0.085987654
1062      19    7        0 0.091058514
1012      26    7        1 0.098231449
1232      27    5        2 0.107458670
1024      16   65        1 0.111078062
1071      33    5        3 0.111573799
1167     144   27        5 0.120143193
1043      25    0        5 0.129629630
1050      20    4        0 0.129629630
1041      31    6        0 0.131645159
1018       6   32        0 0.133579563
1225     110   13        5 0.139010959
1162      25    1        3 0.140573391
1159       0   27        4 0.147300266
1025     194   25        3 0.147728810
1058      44    6        0 0.151822222
1164      46    5        1 0.152942143
1217     124   11        1 0.168192522
1165      21    2        0 0.169292166
0         33    2        1 0.170267490
1161      23    2        0 0.173155556
1205      82    6        1 0.173406837
1156      24    1        1 0.173898751
1180      57    3        1 0.180835498
1193     934   58        3 0.183739692
1237      78    3        1 0.190990812
1192       1   42        1 0.192952250
1091       1   61        1 0.201562106
1230     280    7        0 0.206358649
1200       3  329        0 0.216252560
1042    7101    1        0 0.222128365
119       65    0        0 0.222222222
1047      40    0        0 0.222222222
1075     110    0        0 0.222222222
1118      83    0        0 0.222222222
1136      80    0        0 0.222222222

tuxpiper commented 8 years ago

We need to revisit the formula so that we return controversiality

lauratolosi commented 8 years ago

So, good news, the score seems to be bounded, between 0 and 0.222222... (which is 2/9 actually). In order to turn the score around, just say: 1 - 9/2*score . It will be 0 for non-controversial and 1 for most controversial. Tadaaa.. math magic :)

tuxpiper commented 8 years ago

Oh great, thanks @lauratolosi !

project-pheme / project-pheme-data-interface

How to aggregate controversiality score to the cluster level #15