Closed stefan-mueller closed 5 years ago
@kbenoit: AppVeyor fails – seemingly because of a problem with ggrepel? All other checks have passed. Could you have a look and let me know what I should change to avoid this problem in the future?
I still think we need some changes to the crowd variables. I see this as having two components plus a third we should add.
NA
means this sentence was not coded.Let's call these (example for immigration):
crowd_immigration_label
; make this a factor "Immigration", "Not Immigration"crowd_immigration_mean
; numeric from -1 to +1, NA
means not codedcrowd_immigration_n
; integer representing the number of coders who contributed to the mean score for this sentenceIn the .Rd description of the crowd_immigration_mean
, we explain the numeric value range and the labels that the coders saw.
Thanks, @kbenoit. I tried to implement all of your suggestions. For the _mean
variables I now clearly explain the possible range and explained the values for the two extremes. I also added the docvars crowd_immigration_n
and crowd_econsocial_n
and updated the documentation accordingly.
The most recent commits include the corpus object data_corpus_EPcoaldebate
, along with the documentation of the object. Beside the docvars we discussed in #7, I also added coder_id
and coder_trust
as document-level variables.
Thanks @stefan-mueller, but I'm still not understanding the data here. I think we need for each sentence only three crowd items: majority policy code (Econ, Social, Neither, NA), policy mean - the average of the numeric scores iff code is not Neither or NA, and coder n for that sentence.
I can not make sense of the "dir" variables here.
> subset(docvars(data_corpus_manifestosentsUK), !is.na(crowd_econsocial_label))[3, 1:8]
party partyname year crowd_econsocial_label crowd_econsocial_dir
Con_1987_3 Con Conservative Party 1987 Economic Right
crowd_econsocial_mean crowd_econsocial_dir_mean crowd_econsocial_n
Con_1987_3 -0.5333333 0.6666667 15
In the Rd for the manifestos, crowd_econsocial_mean
and crowd_immigration_mean
are numerics, not integers.
Am I missing some important point on the dir variables? I think we should drop them.
On the n
variables, it occurs to me that we might have had even more coders on each sentence than those contributing to the mean score, since some might have coded it as not part of that policy (and hence not provided a directional rating) but it's probably better to leave it as is.
Thanks a lot, @kbenoit, I really appreciate that you checked the corpus so carefully. I made minor changes to the branch. You can reproduce the examples below when you re-install the branch issue-7
.
The crowd_econsocial_dir
docvar contains the aggregated ranking for the second step of the classification for Social/Econ/Neither. I chose the labels based on Table 1 of Benoit et al (2016). We can use this variable to train textmodel_affinity()
and determine positions. Therefore, I don't think we should drop this variable.
library(quanteda.classifiers)
table(docvars(data_corpus_manifestosentsUK, "crowd_econsocial_dir"),
docvars(data_corpus_manifestosentsUK, "crowd_econsocial_label"),
useNA = "always")
#>
#> Economic Not Economic or Social Social
#> Conservative 0 0 1448
#> Neither left nor right 1089 0 0
#> Neither liberal nor conservative 0 0 570
#> Right 3050 0 0
#> Very conservative 0 0 370
#> Very left 527 0 0
#> Very liberal 0 0 756
#> Very right 494 0 0
#> <NA> 4047 3385 2527
#>
#> <NA>
#> Conservative 0
#> Neither left nor right 0
#> Neither liberal nor conservative 0
#> Right 0
#> Very conservative 0
#> Very left 0
#> Very liberal 0
#> Very right 0
#> <NA> 51017
I agree that crowd_econsocial_dir_mean
and crowd_immigration_dir_mean
are not very informative, as they are just used to create the levels of crowd_econsocial_dir
and crowd_immigration_dir
, and I dropped these variables. But we could add crowd_econsocial_dir_sd
and crowd_immigration_dir_sd
which would contain the standard deviation of the numeric direction variables. This could be used for filtering "ambiguous" directional sentences.
I corrected the errors in the documentation of the corpora in terms of numeric/integer variables.
Generally, for the calculation of crowd_econsocial_n
and crowd_immigration_n
I included all codings that contributed to the mean score. [Yet, there is a problem with the available data for Econ/Social/Neither.]
Table 4 (Benoit et al. 2016: 291) states that you collected 49,225 crowd codings. When I include the coalition agreement to the corpus (I excluded this in the previous version, but added it now), we considered those 49,225 codings for the aggregation.
# check how many crowd codings are used for the aggregation (should be 49,255)
sum(docvars(data_corpus_manifestosentsUK, "crowd_immigration_n"), na.rm = TRUE)
#> [1] 49225
For the Econ/Social/Neither I used all available files in CFjobresults.zip
from the replication repository.
After excluding screeners, I get the exact same number of unique total sentences, but have fewer total codings (nrow()
of the combined dataset equals 184,603 (see line 23 of create_data_corpus_manifestosentUK.R
), but Table 1 reports 215,107 crowd codings. Do you have an additional file that is not included in this ZIP file?
However, the number of unique sentences in the corpus exactly matches the information of Table 1 (18.263 sentences), so all sentences are included in our corpus.
# check the number of total sentences coded in terms of Econ/Social/Neither
# (should be 18,263; see Table 1)
data_corpus_manifestosentsUK %>%
corpus_subset(!is.na(crowd_econsocial_label)) %>%
ndoc()
#> [1] 18263
Please have a look at the update, @kbenoit. I only include the variables that are absolutely necessary and removed the manually recoded _dir
variables. The positional variables are now numeric and are called crowd_econsocial_mean
and crowd_immigration_mean
.
Could you also have a look at ?data_corpus_manifestosentsUK
and check whether the variable description is comprehensible?
?data_corpus_manifestosentsUK
). Please have a look at the corpus and let me know whether something is missing or unclear.