quanteda / quanteda.classifiers

quanteda textmodel extensions for classifying documents
21 stars 2 forks source link

Update UK election manifesto dataset #15

Closed kbenoit closed 4 years ago

kbenoit commented 4 years ago
> length(which(ntoken(data_corpus_manifestosentsUK) > 200))
[1] 103
> median(ntoken(data_corpus_manifestosentsUK[which(ntoken(data_corpus_manifestosentsUK) > 200)]))
[1] 235
> max(ntoken(data_corpus_manifestosentsUK[which(ntoken(data_corpus_manifestosentsUK) > 200)]))
[1] 1025
stefan-mueller commented 4 years ago

I have collected all manifestos from 2015, 2017, and 2019, and updated the code by replacing corpus_reshape() with spacy_tokenize(what = "sentence").

This change improves the segmentation considerably.

length(which(ntoken(data_corpus_manifestosentsUK) > 200))
#> [1] 3

median(ntoken(data_corpus_manifestosentsUK[which(ntoken(data_corpus_manifestosentsUK) > 200)]))
#> [1] 213

max(ntoken(data_corpus_manifestosentsUK[which(ntoken(data_corpus_manifestosentsUK) > 200)]))
#> [1] 221

max(ntoken(data_corpus_manifestosentsUK[which(ntoken(data_corpus_manifestosentsUK) > 200)]))
#> [1] 221

However, there are still very short sentences in the corpus which are not segmented correctly.

length(which(ntoken(data_corpus_manifestosentsUK) < 3))
#> [1] 3273

@kbenoit: should we set a threshold and remove all "sentences" with ntoken < 3 (for instance) or do we keep all sentences in data_corpus_manifestosentsUK and remove those short sentences when preparing the data for our analysis?

kbenoit commented 4 years ago

Great, pls make this a PR with the code you used to source the documents and update/segment them, which can go in tests/data_creation.

We don't want to remove any sentences from the source corpus object, but yes these will be removed before coding. (That's exactly what quanteda::corpus_trim() is designed for - but with the documents already segmented into sentences, we won't need corpus_trim() since we will just corpus_subset() based on document (sentence) length in characters or tokens.)