Update UK election manifesto dataset

kbenoit commented 4 years ago

[x] Add manifestos from 2015, 2017, 2019
[x] Use spacyr to tokenize the all but 2010 into sentences, and replace existing units. Keep 2010 as is, since these were coded that way. Some of the existing sentences clearly did not get segmented correctly.

> length(which(ntoken(data_corpus_manifestosentsUK) > 200))
[1] 103
> median(ntoken(data_corpus_manifestosentsUK[which(ntoken(data_corpus_manifestosentsUK) > 200)]))
[1] 235
> max(ntoken(data_corpus_manifestosentsUK[which(ntoken(data_corpus_manifestosentsUK) > 200)]))
[1] 1025

stefan-mueller commented 4 years ago

I have collected all manifestos from 2015, 2017, and 2019, and updated the code by replacing corpus_reshape() with spacy_tokenize(what = "sentence").

This change improves the segmentation considerably.

length(which(ntoken(data_corpus_manifestosentsUK) > 200))
#> [1] 3

median(ntoken(data_corpus_manifestosentsUK[which(ntoken(data_corpus_manifestosentsUK) > 200)]))
#> [1] 213

max(ntoken(data_corpus_manifestosentsUK[which(ntoken(data_corpus_manifestosentsUK) > 200)]))
#> [1] 221

max(ntoken(data_corpus_manifestosentsUK[which(ntoken(data_corpus_manifestosentsUK) > 200)]))
#> [1] 221

However, there are still very short sentences in the corpus which are not segmented correctly.

length(which(ntoken(data_corpus_manifestosentsUK) < 3))
#> [1] 3273

@kbenoit: should we set a threshold and remove all "sentences" with ntoken < 3 (for instance) or do we keep all sentences in data_corpus_manifestosentsUK and remove those short sentences when preparing the data for our analysis?

kbenoit commented 4 years ago

Great, pls make this a PR with the code you used to source the documents and update/segment them, which can go in tests/data_creation.

We don't want to remove any sentences from the source corpus object, but yes these will be removed before coding. (That's exactly what quanteda::corpus_trim() is designed for - but with the documents already segmented into sentences, we won't need corpus_trim() since we will just corpus_subset() based on document (sentence) length in characters or tokens.)

quanteda / quanteda.classifiers

Update UK election manifesto dataset #15