sailuh / perceive

PERCEIVE is a project incubator inspired by Apache Incubator and Stack Exchange's Area 51. It serves as a staging zone repository for the project early ideas.
http://sailuh.github.io/perceive
GNU General Public License v2.0
11 stars 22 forks source link

LDAvis throws error for some LDA models #84

Open carlosparadis opened 7 years ago

carlosparadis commented 7 years ago

The following error is displayed and no visualization is generated:

Error in stats::cmdscale(dist.mat, k = 2) : NA values not allowed in 'd'

Verified to occur in both old and new crawler, on year 2013, months Feb, Apr, Dec.

carlosparadis commented 7 years ago

The problem lies upstream on LDAvis package itself. See the opened issue on the project.

The problem can be circumvented by defining another jsPCA function which is the parameter mds.method in the createJSON:

jsPCA <- function(phi) {
  # first, we compute a pairwise distance between topic distributions
  # using a symmetric version of KL-divergence
  # http://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence
  jensenShannon <- function(x, y) {
    m <- 0.5*(x + y)
    0.5*sum(x*log(x/m)) + 0.5*sum(y*log(y/m))
  }
  dist.mat <- proxy::dist(x = phi, method = jensenShannon)
  # then, we reduce the K by K proximity matrix down to K by 2 using PCA
  pca.fit <- stats::cmdscale(dist.mat, k = 2)
  data.frame(x = pca.fit[,1], y = pca.fit[,2])
}

Also, we can follow the route to fix the existing function by adding something to smooth the probability distribution 0s.


When executing createJSON, the following error will be thrown:

Error in stats::cmdscale(dist.mat, k = 2) : NA values not allowed in 'd'

I traced it down to:

https://github.com/cpsievert/LDAvis/blob/51bb51e6f2dd26c9d495a76482018d94a9945ddc/R/createJSON.R#L298-L304

To reproduce the issue:

Reproducible dataset

x <- c(0.2,0.3,0.3)
y <- c(0.2,0.3,0.4) 
b <- c(0.2,0.3,0) 

Using LDAvis implementation shown at the start of this issue:

> jensenShannon(x=x,y=y)
[1] 0.003583677
> jensenShannon(x=x,y=b)
[1] NaN

The same test, using cosine function from lsa package:

> cosine(x=x,y=y)
          [,1]
[1,] 0.9897595
> cosine(x=x,y=b)
          [,1]
[1,] 0.7687061
carlosparadis commented 7 years ago

For usage, plotLDAVis(models[["Jan"]],as.gist=FALSE) now allows a new parameter which is a variant of the default accepted by createJSON:

plotLDAVis(models[["Jan"]],as.gist=FALSE,topicSimilarityMethod = CalculateTopicCosineSimilarity)

With the new parameter and passing the new function, it will use the cosine function from package lsa, which is also the one used to compare topics between different months.

carlosparadis commented 6 years ago

The issue was fixed in the original code. Should test locally.