Closed kbenoit closed 4 years ago
I have collected all manifestos from 2015, 2017, and 2019, and updated the code by replacing corpus_reshape()
with spacy_tokenize(what = "sentence")
.
This change improves the segmentation considerably.
length(which(ntoken(data_corpus_manifestosentsUK) > 200))
#> [1] 3
median(ntoken(data_corpus_manifestosentsUK[which(ntoken(data_corpus_manifestosentsUK) > 200)]))
#> [1] 213
max(ntoken(data_corpus_manifestosentsUK[which(ntoken(data_corpus_manifestosentsUK) > 200)]))
#> [1] 221
max(ntoken(data_corpus_manifestosentsUK[which(ntoken(data_corpus_manifestosentsUK) > 200)]))
#> [1] 221
However, there are still very short sentences in the corpus which are not segmented correctly.
length(which(ntoken(data_corpus_manifestosentsUK) < 3))
#> [1] 3273
@kbenoit: should we set a threshold and remove all "sentences" with ntoken < 3 (for instance) or do we keep all sentences in data_corpus_manifestosentsUK
and remove those short sentences when preparing the data for our analysis?
Great, pls make this a PR with the code you used to source the documents and update/segment them, which can go in tests/data_creation
.
We don't want to remove any sentences from the source corpus object, but yes these will be removed before coding. (That's exactly what quanteda::corpus_trim()
is designed for - but with the documents already segmented into sentences, we won't need corpus_trim()
since we will just corpus_subset()
based on document (sentence) length in characters or tokens.)
[x] Add manifestos from 2015, 2017, 2019
[x] Use spacyr to tokenize the all but 2010 into sentences, and replace existing units. Keep 2010 as is, since these were coded that way. Some of the existing sentences clearly did not get segmented correctly.