trinker / qdap

Quantitative Discourse Analysis Package: Bridging the gap between qualitative data and quantitative analysis
http://cran.us.r-project.org/web/packages/qdap/index.html
175 stars 44 forks source link

Multy ngrams not treated as one and the same #123

Closed sculptor2 closed 11 years ago

sculptor2 commented 11 years ago

The 'problem' I encounter with the termco function is that it doesn't always count bigrams as I expected it would. If there are words used more than once in a text, e.g. twice the word 'the' and twice the word 'and', termco will take these as two different kinds of bigrams: 'the and.1' and 'the and.2' with a freqency of 1 each. I need (and expected) only 'the and' with a frequency of 2. Is there a way to change this termco outcome? any thanks in advance!

trinker commented 11 years ago

Please provide a MWE example so that I can understand the problem. If I understand correctly I don't believe you're correct, that tells me I don't understand what you mean.

sculptor2 commented 11 years ago

Example wiht data frame tekstfile:

                  texts  USER_CODE
X1 ik ben ik ben ik ben 17.007.395
X2 ik ben ik was ik ben 17.007.396
X3      ik ben en ik en 17.007.398
X4  en was en was en ik 17.007.399
X5         en ik was en 17.007.400

bigrams <- sapply(ngrams(tekstfile$texts)[[c("all_n", "n_2")]], paste, collapse=" ")
BIGRAMS <- termco(tekstfile$texts, tekstfile$USER_CODE, bigrams)[["raw"]]

## examine BIGRAMS. You will  notice that there are 
## e.g. three 'ben ik' bigrams. Shouldn't these be just one bigram?

freqs <- BIGRAMS[c(-1,-2)]
N <- 20
ords <- rev(sort(colSums(freqs)))[1:N] 
top <- freqs[, names(ords)]
tekstfileFREQ <- data.frame(tekstfile, top, check.names = FALSE)

examine tekstfileFREQ and you will see what I mean: bigrams as 'ben ik.1', 'ben ik.2', etc.

I am not sure if this is a bug or just an unwanted feature :-)

trinker commented 11 years ago

Can you use dput to recreate that data.frame above? Right now it's difficult to read in with read.table.

sculptor2 commented 11 years ago
structure(list(texts = structure(c(4L, 5L, 3L, 2L, 1L), .Label = c("en ik was en", 
"en was en was en ik", "ik ben en ik en", "ik ben ik ben ik ben", 
"ik ben ik was ik ben"), class = "factor"), USER_CODE = structure(1:5, .Label = c("17.007.395", 
"17.007.396", "17.007.398", "17.007.399", "17.007.400"), class = "factor")), .Names = c("texts", 
"USER_CODE"), row.names = c(NA, -5L), class = "data.frame")

I hope this will do.

trinker commented 11 years ago

ngrams creates all possible ngrams and some may be duplicates. This is useful behavior if someone is looking to get over all counts of ngrams. This is easy to change if you want. Simply use unique as in:

bigrams <- unique(sapply(ngrams(tekstfile$texts)[[c("all_n", "n_2")]], paste, collapse=" "))

The numbers appear next to the column names because this is the way data.frame works. If you have duplicates and don't like this use check.names = FALSE (however using unique you won't have duplicates).

sculptor2 commented 11 years ago

Thank you so much!

trinker commented 11 years ago

If that answered your question you can close this issue.