quanteda / quanteda.textplots

Plotting and visualisation for quanteda
GNU General Public License v3.0
6 stars 1 forks source link

wordcloud comparison behaviour needs possible rethink #1

Open kbenoit opened 3 years ago

kbenoit commented 3 years ago

I had not fully realized the implications so far from our reliance on code from wordcloud::comparison.cloud(), which according to its man page does the following:

Let p_{i,j} be the rate at which word i occurs in document j, and p_j be the average across documents(∑ip{i,j}/ndocs). The size of each word is mapped to its maximum deviation ( maxi(p{i,j}-p_j) ), and its angular position is determined by the document where that maximum occurs.

So words that occur at the same rate across partitions are not mapped, and each word is mapped only to one partition. If comparing three groups for instance, where two talk a lot about "x", and a third about "y", then while group three will have "x" plotted for it, only one of group one or two will have "x". And if they use "x" at the same rates, neither will have it plotted.

I can think of many reasons why we would want to change this behaviour, or at least provide alternative options.

library("quanteda")
## Package version: 2.1.2

dfmat <- as.dfm(
  matrix(c(
    1, 2, 3, 2, 1,
    3, 2, 1, 2, 3
  ),
  nrow = 2,
  dimnames = list(c("d1", "d2"), letters[1:5]), byrow = TRUE
  )
)
dfmat
## Document-feature matrix of: 2 documents, 5 features (0.0% sparse).
##     features
## docs a b c d e
##   d1 1 2 3 2 1
##   d2 3 2 1 2 3

# all are same size
textplot_wordcloud(dfmat, min_count = 1)


# three different sizes
textplot_wordcloud(dfmat[1, ], min_count = 1)


# empty because there is no "maximum deviation" across documents
textplot_wordcloud(dfmat[c(1, 1), ], min_count = 1, comparison = TRUE)
## Error in graphics::strwidth(word[i], cex = size[i]): invalid 'cex' value


# was this what we were expecting?
textplot_wordcloud(dfmat, min_count = 1, comparison = TRUE)