Closed sculptor2 closed 11 years ago
Please provide a MWE example so that I can understand the problem. If I understand correctly I don't believe you're correct, that tells me I don't understand what you mean.
Example wiht data frame tekstfile:
texts USER_CODE
X1 ik ben ik ben ik ben 17.007.395
X2 ik ben ik was ik ben 17.007.396
X3 ik ben en ik en 17.007.398
X4 en was en was en ik 17.007.399
X5 en ik was en 17.007.400
bigrams <- sapply(ngrams(tekstfile$texts)[[c("all_n", "n_2")]], paste, collapse=" ")
BIGRAMS <- termco(tekstfile$texts, tekstfile$USER_CODE, bigrams)[["raw"]]
## examine BIGRAMS. You will notice that there are
## e.g. three 'ben ik' bigrams. Shouldn't these be just one bigram?
freqs <- BIGRAMS[c(-1,-2)]
N <- 20
ords <- rev(sort(colSums(freqs)))[1:N]
top <- freqs[, names(ords)]
tekstfileFREQ <- data.frame(tekstfile, top, check.names = FALSE)
examine tekstfileFREQ and you will see what I mean: bigrams as 'ben ik.1', 'ben ik.2', etc.
I am not sure if this is a bug or just an unwanted feature :-)
Can you use dput
to recreate that data.frame above? Right now it's difficult to read in with read.table
.
structure(list(texts = structure(c(4L, 5L, 3L, 2L, 1L), .Label = c("en ik was en",
"en was en was en ik", "ik ben en ik en", "ik ben ik ben ik ben",
"ik ben ik was ik ben"), class = "factor"), USER_CODE = structure(1:5, .Label = c("17.007.395",
"17.007.396", "17.007.398", "17.007.399", "17.007.400"), class = "factor")), .Names = c("texts",
"USER_CODE"), row.names = c(NA, -5L), class = "data.frame")
I hope this will do.
ngrams
creates all possible ngrams
and some may be duplicates. This is useful behavior if someone is looking to get over all counts of ngrams. This is easy to change if you want. Simply use unique
as in:
bigrams <- unique(sapply(ngrams(tekstfile$texts)[[c("all_n", "n_2")]], paste, collapse=" "))
The numbers appear next to the column names because this is the way data.frame
works. If you have duplicates and don't like this use check.names = FALSE
(however using unique you won't have duplicates).
Thank you so much!
If that answered your question you can close this issue.
The 'problem' I encounter with the termco function is that it doesn't always count bigrams as I expected it would. If there are words used more than once in a text, e.g. twice the word 'the' and twice the word 'and', termco will take these as two different kinds of bigrams: 'the and.1' and 'the and.2' with a freqency of 1 each. I need (and expected) only 'the and' with a frequency of 2. Is there a way to change this termco outcome? any thanks in advance!