trinker / qdap

Quantitative Discourse Analysis Package: Bridging the gap between qualitative data and quantitative analysis
http://cran.us.r-project.org/web/packages/qdap/index.html
175 stars 44 forks source link

missing values in termco.h #147

Closed trinker closed 10 years ago

trinker commented 10 years ago

Sent via email by Matt Williamson:

I am using your qdap package to search for and count key words in text fields associated with various entries from the Federal Register. Unfortunately, data standards for the FR are not well-enforced so some fields have NA values which seem to cause issues when running termco.h (which I used based on some StackOverflow examples). Here is the code I am using (which works fine on fields where every record has an entry):

ACT.NM <- qdap:::termco.h(SWCSC.MTG$ACT, "national monument", ignore.case = TRUE, seq_along(SWCSC.MTG$ACT))[,3]
where SWCSC.MTG$ACT is the field of interest within a dataframe (SWCSC.MTG) that has 6519 records.

and here is the error:

Error in data.frame(z[[1]][1], Z[, 2], lapply(seq_along(z), function(x) z[[x]][,  : 
  arguments imply differing number of rows: 6519, 6360

I am assuming that there is some internal function call that is having a problem with the NA values in the ACT field. Is there a way to work around this so that those rows that do not contain the term receive a 0 and those rows that have NA are identified with NA? Any help you can offer would be much appreciated.

trinker commented 10 years ago

Can I ask that in the future you use qdap's issues page as directed in the hep manual: https://github.com/trinker/qdap/issues?state=open? This allows others to see a problem and (a) help solve it (b) learn from the solutions of others.

Solving this problem is difficult if it can't be reproduced. I don't have your data set so its difficult to wrap my head around what you're after. You could make a dummy data set with missing values. Something along the lines of:

library(qdap)
DATA[c(3, 8), 4] <- NA
DATA

##        person sex adult                                 state code
## 1         sam   m     0         Computer is fun. Not too fun.   K1
## 2        greg   m     0               No it's not, it's dumb.   K2
## 3     teacher   m     1                                  <NA>   K3
## 4         sam   m     0                  You liar, it stinks!   K4
## 5        greg   m     0               I am telling the truth!   K5
## 6       sally   f     0                How can we be certain?   K6
## 7        greg   m     0                      There is no way.   K7
## 8         sam   m     0                                  <NA>   K8
## 9       sally   f     0           What are you talking about?   K9
## 10 researcher   f     1         Shall we move on?  Good then.  K10
## 11       greg   m     0 I'm hungry.  Let's eat.  You already?  K11

Then I can run your code. Right now I'm not understanding why you'd use the non-exported function (termco.h) instead of termco itself. I also don't know what version of qdap you're using. Using: packageDescription("qdap")["Version"] will provide that information.

Using the example I gave above I reproduced your error:

DATA$x <- factor(seq_along(DATA$state))
with(DATA, qdap:::termco.h(state, "the", x))

And have made fixes to the development version of qdap with the line:

X[is.na(X[, "Y"]), "Y"] <- 0

You'll have to install devtools and use the development version to get the update (see: https://github.com/trinker/qdap#installation). However I'd suggest using the termco function as it's more elegant. Here is an example:

with(DATA, termco(state, seq_along(DATA$state), "the"))
## Just the counts:
with(DATA, termco(state, seq_along(DATA$state), "the"))$raw[, 3]