Add an argument for smoothing

kbenoit commented 5 years ago

Kullback distance cannot be less than 1, but it appears as negative in the example below in a great number of cases.

library("quanteda")
## Package version: 1.4.3
mt <- dfm(corpus_subset(data_corpus_inaugural, Year > 2000))

distmat <- proxyC::dist(mt, margin = 2, method = "kullback")
sum(distmat < 0)
## [1] 816106
## [1] 619162
as.matrix(distmat[1:10, 1:2])
##               president     clinton
## president     0.0000000 -0.07015385
## clinton       0.3700902  0.00000000
## ,             0.1856503 -0.27850191
## distinguished 0.5500866 -0.03926101
## guests        0.5500866 -0.03926101
## and           0.1983758 -0.27039233
## my            0.2744238 -0.31517416
## fellow        0.2491922 -0.16132904
## citizens      0.1430075 -0.22045425
## the           0.2866488 -0.21170665

(from https://github.com/quanteda/quanteda/issues/1693)

rcannood commented 5 years ago

The main issue is that the kullback distance does not make a lot of sense when the data contains a lot of zeroes. Even one zero already results in NaNs and Infs:

P <- c(0, 1, 2, 3)
Q <- c(4, 3, 4, 5)
P <- P / sum(P)
Q <- Q / sum(Q)
sum(P * log(P / Q))
# [1] NaN
sum(Q * log(Q / P))
# [1] Inf

Sometimes this is solved by leaving out the values which contain zeros, which is the approach that proxyC is using (see pair.cpp#48). The negative values are a result of the sums being calculated before removing the zero values:

P <- c(0, 1, 2, 3)
Q <- c(4, 3, 4, 5)
P <- P / sum(P)
Q <- Q / sum(Q)
ix <- P != 0 & Q != 0
P <- P[ix]
Q <- Q[ix]
sum(P * log(P / Q))
# [1] 0.3112653
sum(Q * log(Q / P))
# [1] -0.1967123

I've added a test (test-dist.R#L54) to at least verify that at least no negative values are being generated.

Solution 1: return NaN when one of the features has at least one zero

I think think it makes most sense to simply return NaN when at least one of the values contains a zero. This would make the results of proxyC resemble those of proxy the most. If you would then want to compute the kullback distance on the given dataset, I would assume I've seen each word at least once, and compute distmat <- proxyC::dist(mt + 1, margin = 2, method = "kullback") instead (which, agreed, would make mt + 1 be non-sparse). See implementation: 2a90bd6.

Solution 2: calculate the sum after having filtered the values

By calculating the sum only after the values have been filtered will solve this specific problem, though the accuracy of the result could be questioned. See implementation: 277bd43.

Different solutions compared on the inaugural corpus

Current proxyC:

library("quanteda")
mt <- dfm(corpus_subset(data_corpus_inaugural, Year > 2000))
mt <- mt[,c("president", "clinton", "my", "fellow", "mount", "varied")]
mt
# Document-feature matrix of: 5 documents, 6 features (33.3% sparse).
# 5 x 6 sparse Matrix of class "dfm"
#             features
# docs         president clinton my fellow mount varied
#   2001-Bush          3       2  3      1     0      0
#   2005-Bush          4       1  2      3     1      1
#   2009-Obama         1       0  2      1     0      0
#   2013-Obama         2       0  3      3     0      0
#   2017-Trump         5       1  1      1     0      0
proxyC::dist(mt, margin = 2, method = "kullback")
# 6 x 6 sparse Matrix of class "dgTMatrix"
#           president     clinton        my    fellow      mount     varied
# president 0.0000000 -0.07015385 0.3108918 0.2680293 -0.3524682 -0.3524682
# clinton   0.3700902  0.00000000 0.6355816 0.8828507 -0.3465736 -0.3465736
# my        0.2744238 -0.31517416 0.0000000 0.1512567 -0.3099542 -0.3099542
# fellow    0.2491922 -0.16132904 0.1367413 0.0000000 -0.3662041 -0.3662041
# mount     1.3217558  1.38629436 1.7047481 1.0986123  0.0000000  0.0000000
# varied    1.3217558  1.38629436 1.7047481 1.0986123  0.0000000  0.0000000

proxy, or proxyC with solution 1 (2a90bd6):

devtools::install_github("koheiw/proxyC#2a90bd6") # restart R session afterwards
proxyC::dist(mt, margin = 2, method = "kullback")
# 6 x 6 sparse Matrix of class "dgTMatrix"
#           president clinton        my    fellow mount varied
# president 0.0000000     NaN 0.3108918 0.2680293   NaN    NaN
# clinton         NaN     NaN       NaN       NaN   NaN    NaN
# my        0.2744238     NaN 0.0000000 0.1512567   NaN    NaN
# fellow    0.2491922     NaN 0.1367413 0.0000000   NaN    NaN
# mount           NaN     NaN       NaN       NaN   NaN    NaN
# varied          NaN     NaN       NaN       NaN   NaN    NaN

by adding 1 in this example:

proxyC::dist(mt+1, margin = 2, method = "kullback")
# 6 x 6 sparse Matrix of class "dgTMatrix"
#            president    clinton         my     fellow      mount     varied
# president 0.00000000 0.05179165 0.15044772 0.12417176 0.07399315 0.07399315
# clinton   0.05577306 0.00000000 0.11326602 0.19190602 0.11477169 0.11477169
# my        0.13798244 0.12514448 0.00000000 0.06183972 0.08097584 0.08097584
# fellow    0.11914888 0.19339084 0.05878166 0.00000000 0.04389137 0.04389137
# mount     0.07024034 0.10683853 0.08494952 0.03862615 0.00000000 0.00000000
# varied    0.07024034 0.10683853 0.08494952 0.03862615 0.00000000 0.00000000

solution 2:

devtools::install_github("koheiw/proxyC#277bd43") # restart R session afterwards
proxyC::dist(mt, margin = 2, method = "kullback")
# 6 x 6 sparse Matrix of class "dgTMatrix"
#           president    clinton         my    fellow mount varied
# president 0.0000000 0.13545124 0.31089180 0.2680293     0      0
# clinton   0.1469467 0.00000000 0.02944576 0.2950641     0      0
# my        0.2744238 0.02831651 0.00000000 0.1512567     0      0
# fellow    0.2491922 0.29739439 0.13674135 0.0000000     0      0
# mount     0.0000000 0.00000000 0.00000000 0.0000000     0      0
# varied    0.0000000 0.00000000 0.00000000 0.0000000     0      0

Which solution is better?

@koheiw Which do you prefer? I think solution 1 makes most sense, but that would mean you can't compute the kullback distance on sparse matrices.

rcannood commented 4 years ago

Kind reminder @koheiw :)

koheiw commented 4 years ago

Noted. Thanks @rcannood.

koheiw commented 4 years ago

@rcannood , sorry for long silence. I was traveling but back home now.

I think returning NA is the right thing to do, but it is petty that it makes kullback useless for sparse data. Adding a fix value is one way to avoid generating NA, but mt + 1 makes the object size very large... So how about adding an argument smoothing to dist() and add the value to each column vector in C++ before computing kullback? This would be useful for other distance measures.

koheiw commented 4 years ago

I spent more time thinking how to solve this issue. Your solution 1 still makes sense, but the related question is what to do with vectors with no variance (or all zero) in cosine and correlation? Users should receive NA for cell for such rows or columns but NA makes matrix dense. For this reason, I am filling resulting matrix with NA only in the downstream function. In this efficiency oriented approach we should return zero for vectors with at least on zero for Kullback. In a more correct approach, we should add a use_na argument to proxyC::simil() and proxyC::dist() and return NA instead of zero when use_na = TRUE. This should apply to cosine and correlation too. What do you think, @rcannood?

I also thought about feasibility and usefulness of adding a smooth argument. It requires a lot of effort especially for the linear functions (cosine, correlation, and Euclidean) despite that the they rarely need smoothing. It is easier to add smoothing to the pair-wise functions but it seems that we need smoothing only for Kullback. I think don't think it is a good idea.

kasperwelbers commented 4 years ago

@koheiw I was just thinking, isn't the problem here that the method actually isn't just the Kullback-Leibler divergence?

In Kullback P & Q are probability distributions, so the code currently also contains the preparatory step of transforming (sparse) vectors to probability distributions (P / sum(P)). The problem is that this transformation is incorrect for sparse data without some form of smoothing, because probabilities can't be zero.

In that sense, it would not be weird if a smooth argument only applies for similarity/distance measures that are based on probability distributions, so it naturally excludes cosine, correlation, etc.

koheiw commented 4 years ago

Good point @kasperwelbers. If it makes sense to add smoothing argument that works only with Kullback, I am happy to write the code.

kasperwelbers commented 4 years ago

Just a thought regarding the implementation @koheiw. You could write it directly into the Kullback function, but you could also make a separate function for transforming the input vectors to probability distributions. The advantage would be that this clearly distinguishes the actual Kullback method from the vector-to-probability-distribution transformation. You could then include this function as preparation in all probability based similarity/distance methods.

You mentioned now only having Kullback, but you might at some point want to implement others (e.g., Jensen-Shannon). This paper also describes a pretty cool language modelling approach to information retrieval where smoothing is central, though I haven't had time to properly test this with document similarity.

Off course, I don't mean to put stuff on you plate. I just wanted to add some for for thought on the question whether smoothing is useful in the dist function.

koheiw / proxyC