Closed stefan-mueller closed 5 years ago
Having decided that we include a sentence-level corpus of all UK manifestos, I already merged quanteda.corpora::data_corpus_ukmanifestos
with the aggregated crowdcoded data from the replication material of the 2016 APSR paper. I also found 13 manifestos from the 2015 and 2017 elections at http://polidoc.net, which I reshaped to the sentence level and added to the corpus. Overall, the corpus consists of 68,000 sentences.
We need to give the corpus a name which needs to be distinguishable from quanteda::data_char_ukimmig2010
and quanteda.corpora::data_corpus_ukmanifestos
.
What about data_corpus_ukmanifestosentences
? Do you have ideas for a less confusing name, @kbenoit? When we have decided how to call the corpus, I will make a PR which also includes a documentation of the new corpus and the relevant document-level variables.
I think we have resolved this now, but @stefan-mueller I'll let you decide and close. The idea of including the EP debate on coal is a good one too. How about calling it data_corpus_epcoaldebate
? (or data_corpus_EPcoaldebate
...)
If we make each unit a coding, then we will repeat some sentences. If we make each unit a sentence, then we will need to report majority categories and mean positions as per the https://github.com/quanteda/quanteda.classifiers/pull/8. Which did you have in mind?
Good idea, @kbenoit. I will add data_corpus_EPcoaldebate
, but for this corpus (in contrast to data_corpus_manifestosentsUK
) we will have one observation per coding, not one observation per sentence.
What about the following names for document-level variables?
sentence_id
: string containing the language and sentence number (e.g. en_1
).speaker
: name of the MEPcoder_subsidy
: sentence-level coding by crowd worker ("Pro-subsidy"/"Anti-subsidy"/"Neutral/inapplicable")vote
: vote by speaker on the bill ("For"/"Against")language
: language of transcriptAnything else I should consider before adding this corpus?
A short update to make it more concrete: I have prepared the corpus and created the following document-level variables. Any suggestions before I add this to #8?
library(quanteda.classifiers)
names(docvars(data_corpus_EPcoaldebate))
#> [1] "sentence_id" "coder_subsidy" "language" "name_last"
#> [5] "name_first" "ep_group" "country" "vote"
summary(data_corpus_EPcoaldebate, 8)
#> Corpus consisting of 16806 documents, showing 8 documents:
#>
#> Text Types Tokens Sentences sentence_id
#> EN_Rapkay_1_1 18 26 1 English_Rapkay_1
#> EN_Rapkay_1_2 18 26 1 English_Rapkay_1
#> EN_Rapkay_1_3 18 26 1 English_Rapkay_1
#> EN_Rapkay_2_1 11 11 1 English_Rapkay_2
#> EN_Rapkay_2_2 11 11 1 English_Rapkay_2
#> EN_Rapkay_2_3 11 11 1 English_Rapkay_2
#> EN_Rapkay_3_1 14 16 1 English_Rapkay_3
#> EN_Rapkay_3_2 14 16 1 English_Rapkay_3
#> coder_subsidy language name_last name_first ep_group country
#> Neutral or inapplicable English Rapkay Bernhard S&D Germany
#> Neutral or inapplicable English Rapkay Bernhard S&D Germany
#> Neutral or inapplicable English Rapkay Bernhard S&D Germany
#> Pro-Subsidy English Rapkay Bernhard S&D Germany
#> Neutral or inapplicable English Rapkay Bernhard S&D Germany
#> Neutral or inapplicable English Rapkay Bernhard S&D Germany
#> Neutral or inapplicable English Rapkay Bernhard S&D Germany
#> Neutral or inapplicable English Rapkay Bernhard S&D Germany
#> vote
#> For
#> For
#> For
#> For
#> For
#> For
#> For
#> For
#>
#> Source: /Users/smueller/Documents/GitHub/quanteda.classifiers/* on x86_64 by smueller
#> Created: Fri May 10 01:03:44 2019
#> Notes:
For consistency, let's call the code variablecrowd_subsidy_label
. The question and coding combined position and topic in the original paper so let's also do that here.
For the examples in the documentation, README and a vignette, it might make sense to add one or two of the crowdcoded corpora from the 2016 APSR paper and aggregate the judgements to the level of sentences.
We would have one observation per sentence with a factor variable indicating the class and another factor variable indicating the direction/position, along with the proportion of agreement between the coders (ranging from 0 to 1). Happy to take care of this. We could add:
data_corpus_economicsocial
: Corpus of the crowdcoded UK manifesto sentences, coded in terms of economic policy, social policy, or neither. With this corpus, we can also train and predict the labels for different subsets (e.g., election/party/decade).data_corpus_euspeeches
: Multilingual corpus with the European Parliament debate about coal subsidies, which can be filtered by language usingcorpus_subset
.These examples might be more suitable for classification tasks instead of predicting party affiliation or government/opposition for the 14 speeches in
data_corpus_dailnoconf1991
.