Closed kbenoit closed 3 years ago
The main issue is that the kullback distance does not make a lot of sense when the data contains a lot of zeroes. Even one zero already results in NaN
s and Inf
s:
P <- c(0, 1, 2, 3)
Q <- c(4, 3, 4, 5)
P <- P / sum(P)
Q <- Q / sum(Q)
sum(P * log(P / Q))
# [1] NaN
sum(Q * log(Q / P))
# [1] Inf
Sometimes this is solved by leaving out the values which contain zeros, which is the approach that proxyC is using (see pair.cpp#48). The negative values are a result of the sums being calculated before removing the zero values:
P <- c(0, 1, 2, 3)
Q <- c(4, 3, 4, 5)
P <- P / sum(P)
Q <- Q / sum(Q)
ix <- P != 0 & Q != 0
P <- P[ix]
Q <- Q[ix]
sum(P * log(P / Q))
# [1] 0.3112653
sum(Q * log(Q / P))
# [1] -0.1967123
I've added a test (test-dist.R#L54) to at least verify that at least no negative values are being generated.
I think think it makes most sense to simply return NaN when at least one of the values contains a zero.
This would make the results of proxyC
resemble those of proxy
the most.
If you would then want to compute the kullback distance on the given dataset, I would assume I've seen each word at least once, and compute distmat <- proxyC::dist(mt + 1, margin = 2, method = "kullback")
instead (which, agreed, would make mt + 1
be non-sparse). See implementation: 2a90bd6.
By calculating the sum only after the values have been filtered will solve this specific problem, though the accuracy of the result could be questioned. See implementation: 277bd43.
Current proxyC:
library("quanteda")
mt <- dfm(corpus_subset(data_corpus_inaugural, Year > 2000))
mt <- mt[,c("president", "clinton", "my", "fellow", "mount", "varied")]
mt
# Document-feature matrix of: 5 documents, 6 features (33.3% sparse).
# 5 x 6 sparse Matrix of class "dfm"
# features
# docs president clinton my fellow mount varied
# 2001-Bush 3 2 3 1 0 0
# 2005-Bush 4 1 2 3 1 1
# 2009-Obama 1 0 2 1 0 0
# 2013-Obama 2 0 3 3 0 0
# 2017-Trump 5 1 1 1 0 0
proxyC::dist(mt, margin = 2, method = "kullback")
# 6 x 6 sparse Matrix of class "dgTMatrix"
# president clinton my fellow mount varied
# president 0.0000000 -0.07015385 0.3108918 0.2680293 -0.3524682 -0.3524682
# clinton 0.3700902 0.00000000 0.6355816 0.8828507 -0.3465736 -0.3465736
# my 0.2744238 -0.31517416 0.0000000 0.1512567 -0.3099542 -0.3099542
# fellow 0.2491922 -0.16132904 0.1367413 0.0000000 -0.3662041 -0.3662041
# mount 1.3217558 1.38629436 1.7047481 1.0986123 0.0000000 0.0000000
# varied 1.3217558 1.38629436 1.7047481 1.0986123 0.0000000 0.0000000
proxy, or proxyC with solution 1 (2a90bd6):
devtools::install_github("koheiw/proxyC#2a90bd6") # restart R session afterwards
proxyC::dist(mt, margin = 2, method = "kullback")
# 6 x 6 sparse Matrix of class "dgTMatrix"
# president clinton my fellow mount varied
# president 0.0000000 NaN 0.3108918 0.2680293 NaN NaN
# clinton NaN NaN NaN NaN NaN NaN
# my 0.2744238 NaN 0.0000000 0.1512567 NaN NaN
# fellow 0.2491922 NaN 0.1367413 0.0000000 NaN NaN
# mount NaN NaN NaN NaN NaN NaN
# varied NaN NaN NaN NaN NaN NaN
by adding 1 in this example:
proxyC::dist(mt+1, margin = 2, method = "kullback")
# 6 x 6 sparse Matrix of class "dgTMatrix"
# president clinton my fellow mount varied
# president 0.00000000 0.05179165 0.15044772 0.12417176 0.07399315 0.07399315
# clinton 0.05577306 0.00000000 0.11326602 0.19190602 0.11477169 0.11477169
# my 0.13798244 0.12514448 0.00000000 0.06183972 0.08097584 0.08097584
# fellow 0.11914888 0.19339084 0.05878166 0.00000000 0.04389137 0.04389137
# mount 0.07024034 0.10683853 0.08494952 0.03862615 0.00000000 0.00000000
# varied 0.07024034 0.10683853 0.08494952 0.03862615 0.00000000 0.00000000
solution 2:
devtools::install_github("koheiw/proxyC#277bd43") # restart R session afterwards
proxyC::dist(mt, margin = 2, method = "kullback")
# 6 x 6 sparse Matrix of class "dgTMatrix"
# president clinton my fellow mount varied
# president 0.0000000 0.13545124 0.31089180 0.2680293 0 0
# clinton 0.1469467 0.00000000 0.02944576 0.2950641 0 0
# my 0.2744238 0.02831651 0.00000000 0.1512567 0 0
# fellow 0.2491922 0.29739439 0.13674135 0.0000000 0 0
# mount 0.0000000 0.00000000 0.00000000 0.0000000 0 0
# varied 0.0000000 0.00000000 0.00000000 0.0000000 0 0
@koheiw Which do you prefer? I think solution 1 makes most sense, but that would mean you can't compute the kullback distance on sparse matrices.
Kind reminder @koheiw :)
Noted. Thanks @rcannood.
@rcannood , sorry for long silence. I was traveling but back home now.
I think returning NA is the right thing to do, but it is petty that it makes kullback useless for sparse data. Adding a fix value is one way to avoid generating NA, but mt + 1
makes the object size very large... So how about adding an argument smoothing
to dist()
and add the value to each column vector in C++ before computing kullback? This would be useful for other distance measures.
I spent more time thinking how to solve this issue. Your solution 1 still makes sense, but the related question is what to do with vectors with no variance (or all zero) in cosine and correlation? Users should receive NA for cell for such rows or columns but NA makes matrix dense. For this reason, I am filling resulting matrix with NA only in the downstream function. In this efficiency oriented approach we should return zero for vectors with at least on zero for Kullback. In a more correct approach, we should add a use_na
argument to proxyC::simil()
and proxyC::dist()
and return NA instead of zero when use_na = TRUE
. This should apply to cosine and correlation too. What do you think, @rcannood?
I also thought about feasibility and usefulness of adding a smooth
argument. It requires a lot of effort especially for the linear functions (cosine, correlation, and Euclidean) despite that the they rarely need smoothing. It is easier to add smoothing to the pair-wise functions but it seems that we need smoothing only for Kullback. I think don't think it is a good idea.
@koheiw I was just thinking, isn't the problem here that the method actually isn't just the Kullback-Leibler divergence?
In Kullback P & Q are probability distributions, so the code currently also contains the preparatory step of transforming (sparse) vectors to probability distributions (P / sum(P)). The problem is that this transformation is incorrect for sparse data without some form of smoothing, because probabilities can't be zero.
In that sense, it would not be weird if a smooth argument only applies for similarity/distance measures that are based on probability distributions, so it naturally excludes cosine, correlation, etc.
Good point @kasperwelbers. If it makes sense to add smoothing argument that works only with Kullback, I am happy to write the code.
Just a thought regarding the implementation @koheiw. You could write it directly into the Kullback function, but you could also make a separate function for transforming the input vectors to probability distributions. The advantage would be that this clearly distinguishes the actual Kullback method from the vector-to-probability-distribution transformation. You could then include this function as preparation in all probability based similarity/distance methods.
You mentioned now only having Kullback, but you might at some point want to implement others (e.g., Jensen-Shannon). This paper also describes a pretty cool language modelling approach to information retrieval where smoothing is central, though I haven't had time to properly test this with document similarity.
Off course, I don't mean to put stuff on you plate. I just wanted to add some for for thought on the question whether smoothing is useful in the dist function.
Kullback distance cannot be less than 1, but it appears as negative in the example below in a great number of cases.
(from https://github.com/quanteda/quanteda/issues/1693)