juliasilge / widyr

Widen, process, and re-tidy a dataset
http://juliasilge.github.io/widyr/
Other
327 stars 29 forks source link

Inf and -Inf from pairwise_cor #7

Open wendywangwwt opened 7 years ago

wendywangwwt commented 7 years ago

Hi,

I'm using widyr to do text mining homework, where I'm asked to calculate word association of NY Time articles.

For the input dataframe, I have word (unigram) and document idx and author name. Then I use the following code to calculate pairwise correlation and pick trump out.

for (name in authors){
  idx <- idx + 1
  l.cor[[idx]] <- d.tc %>%
    filter(author == name) %>%
    pairwise_cor(word, document) %>%
    filter(!is.na(correlation))
}

trump.cor <- rbind(l.cor[[1]]%>% 
                     filter(item1 == "trump") %>%
                     mutate(author = authors[1]),
                   l.cor[[2]] %>%
                     filter(item1 == "trump")%>%
                     mutate(author = authors[2]),
                   l.cor[[3]] %>%
                     filter(item1 == "trump")%>%
                     mutate(author = authors[3]),
                   l.cor[[4]] %>%
                     filter(item1 == "trump")%>%
                     mutate(author = authors[4]),
                   l.cor[[5]] %>%
                     filter(item1 == "trump")%>%
                     mutate(author = authors[5]))

There are inf values in the result:

> trump.cor[which(trump.cor$correlation==Inf),]
# A tibble: 38 × 4
    item1    item2 correlation             author
   <fctr>   <fctr>       <dbl>             <fctr>
1   trump       ad         Inf Thomas L. Friedman
2   trump american         Inf Thomas L. Friedman
3   trump      ani         Inf Thomas L. Friedman
4   trump    anoth         Inf Thomas L. Friedman
5   trump      bad         Inf Thomas L. Friedman
6   trump    bring         Inf Thomas L. Friedman
7   trump   candid         Inf Thomas L. Friedman
8   trump   common         Inf Thomas L. Friedman
9   trump  connect         Inf Thomas L. Friedman
10  trump democrat         Inf Thomas L. Friedman
# ... with 28 more rows
> summary(trump.cor)
       item1            item2        correlation                     author    
 trump    :20908   ad      :    5   Min.   :   -Inf   David Brooks      :4710  
 a        :    0   american:    5   1st Qu.:0.02592   Maureen Dowd      :5909  
 aaron    :    0   ani     :    5   Median :0.03043   Nicholas Kristof  :5877  
 aarondmil:    0   anoth   :    5   Mean   :    NaN   Paul Krugman      :4372  
 aarp     :    0   bad     :    5   3rd Qu.:0.04327   Thomas L. Friedman:  40  
 ababa    :    0   bring   :    5   Max.   :    Inf                            
 (Other)  :    0   (Other) :20878                                              

For anyone who wants to replicate my result, the r data file (read using readRDS) is attached. trump clinton.zip