quanteda / quanteda

An R package for the Quantitative Analysis of Textual Data
https://quanteda.io
GNU General Public License v3.0
840 stars 188 forks source link

textstat_simil(proxy, method = "faith") should return 1.0 for similarity of a document with itself #1675

Closed kbenoit closed 5 years ago

kbenoit commented 5 years ago

All other measures of similarity return 1.0 for the similarity of a vector with itself. Except "Faith", which appears to penalize similarity as identical vectors grow in length. I suspect this is an error in the formula, unless I have misunderstood Faith, D. P. (1983). Asymmetric binary similarity measures. Oecologia, 57(3), 287-290..

In Faith (1983) it states that the formula is (a + d/2) / n, where a is (TRUE, TRUE) for a match and d is (FALSE, FALSE). For a vector with itself, a = n and d = 0, which should evaluate to 1.0. But this is not what happens.

library("quanteda")
## Package version: 1.4.4

set.seed(100)
x <- rpois(1000, lambda = 3)
dfmat <- as.dfm(matrix(c(x, x), nrow = 2, byrow = TRUE))
textstat_simil(dfmat, method = "faith")
##       x     y similarity
## 1 text2 text1      0.972
## 2 text1 text1      0.972
## 3 text1 text2      0.972
## 4 text2 text2      0.972

dfmat <- as.dfm(matrix(c(x[1:300], x[1:300]), nrow = 2, byrow = TRUE))
textstat_simil(dfmat, method = "faith")
##       x     y similarity
## 1 text2 text1      0.975
## 2 text1 text1      0.975
## 3 text1 text2      0.975
## 4 text2 text2      0.975
koheiw commented 5 years ago

My understanding is that the faith distance is (a + d/2) / N where

"a" equal the number of shared presences, "d" equal the number of shared absences. "N" is the number of characters.

so diagonals become smaller than 1 when documents have zeros.

kbenoit commented 5 years ago

Then it comes down to the definition of the eligible feature set for comparison. I don't think that this should happen:

txt <- c(
  "a c",
  "a c",
  "a b c"
)
textstat_simil(dfm(txt[1:2]), method = "faith")
##       x     y similarity
## 1 text2 text1          1
## 2 text1 text1          1
## 3 text1 text2          1
## 4 text2 text2          1
textstat_simil(dfm(txt[1:3]), method = "faith")
##       x     y similarity
## 1 text2 text1  0.8333333
## 2 text3 text1  0.6666667
## 3 text3 text2  0.6666667
## 4 text1 text1  0.8333333
## 5 text1 text2  0.8333333
## 6 text2 text2  0.8333333
## 7 text1 text3  0.6666667
## 8 text2 text3  0.6666667
## 9 text3 text3  1.0000000

where the addition of a third text affects the score of the similarity of the first two texts. There is no theoretical bound on the "absence pairs" of additional features, since we could add infinite absence pairs. And the addition of a third text should not affect the similarity of the first two.

My reading of Faith (1983). So for text1 and text2 in the second example above, I would count a = 2, d = 0, and exclude the features that do not exist in the comparison. But we should read the original more carefully. At the least, we should excluding absent features an option.

kbenoit commented 5 years ago

I just re-read Faith (1983). That paper's concern with "asymmetry" is motivated by treating shared absences differently from shared presences of a feature, but it warns that absences may be indeterminate if the absence is not on a meaningful characteristic. That paper is from ecology, where similar ecosystems are compared on the basis of presence or absence of species, but would not consider it meaningful to consider that the shared absence of lions in two New England bogs should reduce their similarity. For us to have a similarly applicable definition for text, we would have to maintain that every feature in the matrix is an "eligible" feature. I don't think we can reasonably maintain this.

I also think we should avoid asymmetric measures generally since they violate two principles:

(Other examples of asymmetric measures are Russel/Rao and Kulczynski, but we do not use these. They are all in proxy however for dense matrices.)

Solution: Drop the faith measure. I doubt anyone has used this anyway for textual similarity, ever. I think the only reason we added it was because it was in proxy::simil() and easy to code. If people want similarity based on binary feature comparisons, they can use Jaccard, Dice, Hamman, or simple matching, which have neither of the two problems identified above.