Add example corpus/corpora

stefan-mueller commented 5 years ago

For the examples in the documentation, README and a vignette, it might make sense to add one or two of the crowdcoded corpora from the 2016 APSR paper and aggregate the judgements to the level of sentences.

We would have one observation per sentence with a factor variable indicating the class and another factor variable indicating the direction/position, along with the proportion of agreement between the coders (ranging from 0 to 1). Happy to take care of this. We could add:

data_corpus_economicsocial: Corpus of the crowdcoded UK manifesto sentences, coded in terms of economic policy, social policy, or neither. With this corpus, we can also train and predict the labels for different subsets (e.g., election/party/decade).
data_corpus_euspeeches: Multilingual corpus with the European Parliament debate about coal subsidies, which can be filtered by language using corpus_subset.

These examples might be more suitable for classification tasks instead of predicting party affiliation or government/opposition for the 14 speeches in data_corpus_dailnoconf1991.

stefan-mueller commented 5 years ago

Having decided that we include a sentence-level corpus of all UK manifestos, I already merged quanteda.corpora::data_corpus_ukmanifestos with the aggregated crowdcoded data from the replication material of the 2016 APSR paper. I also found 13 manifestos from the 2015 and 2017 elections at http://polidoc.net, which I reshaped to the sentence level and added to the corpus. Overall, the corpus consists of 68,000 sentences.

We need to give the corpus a name which needs to be distinguishable from quanteda::data_char_ukimmig2010 and quanteda.corpora::data_corpus_ukmanifestos.

What about data_corpus_ukmanifestosentences? Do you have ideas for a less confusing name, @kbenoit? When we have decided how to call the corpus, I will make a PR which also includes a documentation of the new corpus and the relevant document-level variables.

kbenoit commented 5 years ago

I think we have resolved this now, but @stefan-mueller I'll let you decide and close. The idea of including the EP debate on coal is a good one too. How about calling it data_corpus_epcoaldebate? (or data_corpus_EPcoaldebate...)

If we make each unit a coding, then we will repeat some sentences. If we make each unit a sentence, then we will need to report majority categories and mean positions as per the https://github.com/quanteda/quanteda.classifiers/pull/8. Which did you have in mind?

stefan-mueller commented 5 years ago

Good idea, @kbenoit. I will add data_corpus_EPcoaldebate, but for this corpus (in contrast to data_corpus_manifestosentsUK) we will have one observation per coding, not one observation per sentence.

What about the following names for document-level variables?

sentence_id: string containing the language and sentence number (e.g. en_1).
speaker: name of the MEP
coder_subsidy: sentence-level coding by crowd worker ("Pro-subsidy"/"Anti-subsidy"/"Neutral/inapplicable")
vote: vote by speaker on the bill ("For"/"Against")
language: language of transcript

Anything else I should consider before adding this corpus?

stefan-mueller commented 5 years ago

A short update to make it more concrete: I have prepared the corpus and created the following document-level variables. Any suggestions before I add this to #8?

library(quanteda.classifiers)

names(docvars(data_corpus_EPcoaldebate))
#> [1] "sentence_id"   "coder_subsidy" "language"      "name_last"    
#> [5] "name_first"    "ep_group"      "country"       "vote"

summary(data_corpus_EPcoaldebate, 8)
#> Corpus consisting of 16806 documents, showing 8 documents:
#> 
#>           Text Types Tokens Sentences      sentence_id
#>  EN_Rapkay_1_1    18     26         1 English_Rapkay_1
#>  EN_Rapkay_1_2    18     26         1 English_Rapkay_1
#>  EN_Rapkay_1_3    18     26         1 English_Rapkay_1
#>  EN_Rapkay_2_1    11     11         1 English_Rapkay_2
#>  EN_Rapkay_2_2    11     11         1 English_Rapkay_2
#>  EN_Rapkay_2_3    11     11         1 English_Rapkay_2
#>  EN_Rapkay_3_1    14     16         1 English_Rapkay_3
#>  EN_Rapkay_3_2    14     16         1 English_Rapkay_3
#>            coder_subsidy language name_last name_first ep_group country
#>  Neutral or inapplicable  English    Rapkay   Bernhard      S&D Germany
#>  Neutral or inapplicable  English    Rapkay   Bernhard      S&D Germany
#>  Neutral or inapplicable  English    Rapkay   Bernhard      S&D Germany
#>              Pro-Subsidy  English    Rapkay   Bernhard      S&D Germany
#>  Neutral or inapplicable  English    Rapkay   Bernhard      S&D Germany
#>  Neutral or inapplicable  English    Rapkay   Bernhard      S&D Germany
#>  Neutral or inapplicable  English    Rapkay   Bernhard      S&D Germany
#>  Neutral or inapplicable  English    Rapkay   Bernhard      S&D Germany
#>  vote
#>   For
#>   For
#>   For
#>   For
#>   For
#>   For
#>   For
#>   For
#> 
#> Source: /Users/smueller/Documents/GitHub/quanteda.classifiers/* on x86_64 by smueller
#> Created: Fri May 10 01:03:44 2019
#> Notes:

kbenoit commented 5 years ago

For consistency, let's call the code variablecrowd_subsidy_label. The question and coding combined position and topic in the original paper so let's also do that here.

quanteda / quanteda.classifiers

Add example corpus/corpora #7