Create data_corpus_manifestosentsUK

stefan-mueller commented 5 years ago

Addresses #7.
The corpus contains all sentence-level aggregated crowdsourced data (Economic, Social, Neither, and Immigration policy), as well as all currently available, non-annotated UK manifestos between 1945 and 2017.
I added an extensive documentation for each variable (see ?data_corpus_manifestosentsUK). Please have a look at the corpus and let me know whether something is missing or unclear.

stefan-mueller commented 5 years ago

@kbenoit: AppVeyor fails – seemingly because of a problem with ggrepel? All other checks have passed. Could you have a look and let me know what I should change to avoid this problem in the future?

kbenoit commented 5 years ago

I still think we need some changes to the crowd variables. I see this as having two components plus a third we should add.

the majority label, e.g. Immigration, Not Immigration. NA means this sentence was not coded.
the mean position, with the manual making the numeric values clear (from -3 to +3 for econ/social, or -1 to +1 for immigration).
(new) the number of coders.

Let's call these (example for immigration):

crowd_immigration_label; make this a factor "Immigration", "Not Immigration"
crowd_immigration_mean ; numeric from -1 to +1, NA means not coded
crowd_immigration_n; integer representing the number of coders who contributed to the mean score for this sentence

In the .Rd description of the crowd_immigration_mean, we explain the numeric value range and the labels that the coders saw.

stefan-mueller commented 5 years ago

Thanks, @kbenoit. I tried to implement all of your suggestions. For the _mean variables I now clearly explain the possible range and explained the values for the two extremes. I also added the docvars crowd_immigration_n and crowd_econsocial_n and updated the documentation accordingly.

The most recent commits include the corpus object data_corpus_EPcoaldebate, along with the documentation of the object. Beside the docvars we discussed in #7, I also added coder_id and coder_trust as document-level variables.

kbenoit commented 5 years ago

Thanks @stefan-mueller, but I'm still not understanding the data here. I think we need for each sentence only three crowd items: majority policy code (Econ, Social, Neither, NA), policy mean - the average of the numeric scores iff code is not Neither or NA, and coder n for that sentence.

I can not make sense of the "dir" variables here.

> subset(docvars(data_corpus_manifestosentsUK), !is.na(crowd_econsocial_label))[3, 1:8]
           party          partyname year crowd_econsocial_label crowd_econsocial_dir
Con_1987_3   Con Conservative Party 1987               Economic                Right
           crowd_econsocial_mean crowd_econsocial_dir_mean crowd_econsocial_n
Con_1987_3            -0.5333333                 0.6666667                 15

In the Rd for the manifestos, crowd_econsocial_mean and crowd_immigration_mean are numerics, not integers.

Am I missing some important point on the dir variables? I think we should drop them.

On the n variables, it occurs to me that we might have had even more coders on each sentence than those contributing to the mean score, since some might have coded it as not part of that policy (and hence not provided a directional rating) but it's probably better to leave it as is.

stefan-mueller commented 5 years ago

Thanks a lot, @kbenoit, I really appreciate that you checked the corpus so carefully. I made minor changes to the branch. You can reproduce the examples below when you re-install the branch issue-7.

Choice of variables

The crowd_econsocial_dir docvar contains the aggregated ranking for the second step of the classification for Social/Econ/Neither. I chose the labels based on Table 1 of Benoit et al (2016). We can use this variable to train textmodel_affinity() and determine positions. Therefore, I don't think we should drop this variable.

library(quanteda.classifiers)

table(docvars(data_corpus_manifestosentsUK, "crowd_econsocial_dir"), 
      docvars(data_corpus_manifestosentsUK, "crowd_econsocial_label"),
      useNA = "always")
#>                                   
#>                                    Economic Not Economic or Social Social
#>   Conservative                            0                      0   1448
#>   Neither left nor right               1089                      0      0
#>   Neither liberal nor conservative        0                      0    570
#>   Right                                3050                      0      0
#>   Very conservative                       0                      0    370
#>   Very left                             527                      0      0
#>   Very liberal                            0                      0    756
#>   Very right                            494                      0      0
#>   <NA>                                 4047                   3385   2527
#>                                   
#>                                     <NA>
#>   Conservative                         0
#>   Neither left nor right               0
#>   Neither liberal nor conservative     0
#>   Right                                0
#>   Very conservative                    0
#>   Very left                            0
#>   Very liberal                         0
#>   Very right                           0
#>   <NA>                             51017

I agree that crowd_econsocial_dir_mean and crowd_immigration_dir_mean are not very informative, as they are just used to create the levels of crowd_econsocial_dir and crowd_immigration_dir, and I dropped these variables. But we could add crowd_econsocial_dir_sd and crowd_immigration_dir_sd which would contain the standard deviation of the numeric direction variables. This could be used for filtering "ambiguous" directional sentences.

I corrected the errors in the documentation of the corpora in terms of numeric/integer variables.

Number of codings

Generally, for the calculation of crowd_econsocial_n and crowd_immigration_n I included all codings that contributed to the mean score. [Yet, there is a problem with the available data for Econ/Social/Neither.]

Immigration

Table 4 (Benoit et al. 2016: 291) states that you collected 49,225 crowd codings. When I include the coalition agreement to the corpus (I excluded this in the previous version, but added it now), we considered those 49,225 codings for the aggregation.

# check how many crowd codings are used for the aggregation (should be 49,255)
sum(docvars(data_corpus_manifestosentsUK, "crowd_immigration_n"), na.rm = TRUE)
#> [1] 49225

Econ/Social/Neither

For the Econ/Social/Neither I used all available files in CFjobresults.zip from the replication repository.

After excluding screeners, I get the exact same number of unique total sentences, but have fewer total codings (nrow() of the combined dataset equals 184,603 (see line 23 of create_data_corpus_manifestosentUK.R), but Table 1 reports 215,107 crowd codings. Do you have an additional file that is not included in this ZIP file?

However, the number of unique sentences in the corpus exactly matches the information of Table 1 (18.263 sentences), so all sentences are included in our corpus.

# check the number of total sentences coded in terms of Econ/Social/Neither 
# (should be 18,263; see Table 1)
data_corpus_manifestosentsUK %>% 
    corpus_subset(!is.na(crowd_econsocial_label)) %>% 
    ndoc()
#> [1] 18263

stefan-mueller commented 5 years ago

Please have a look at the update, @kbenoit. I only include the variables that are absolutely necessary and removed the manually recoded _dir variables. The positional variables are now numeric and are called crowd_econsocial_mean and crowd_immigration_mean.

Could you also have a look at ?data_corpus_manifestosentsUK and check whether the variable description is comprehensible?

quanteda / quanteda.classifiers