Open andreanini opened 1 year ago
Dear Kohei, Thank you very much for this and for all the work you are doing for quanteda. Really amazing! I saw the vignette but I suppose I do not have enough sophistication with the notation to understand how that formula (which looks like the Tanimoto coefficient?) relates to those other variants of Jaccard. Min-max or Ruzicka is:
$\frac{\sum_i min(x_i, y_i)}{\sum_i max(x_i, y_i)}$
I suppose I was hoping for more clarifications on the mathematics rather than on the code implementation.
This package was originally created to replicate the proxy package for text analysis (for textstat_simil()
). I can find above formula as `fjaccard" in their vignette. I did not implement it but I could.
https://cran.r-project.org/web/packages/proxy/vignettes/overview.pdf
oh yes, of course. They list both of them so it makes sense that they are distinct coefficients. Sorry about that. At the moment I'm using proxy to run the "fjaccard" but, of course, I'd love to use textstat_simil() instead. Much faster!
Dear Kohei,
I forked your repo as I believed I could easily add this myself and then send a pull request. I made the change, I think, but I'm not an expert of C++ and I'm stuck at loading the R package to test it. It seems there is a library missing or not in the right path.
The change I made is I added the following function to pair.cpp
double simil_fjaccard(colvec& col_i, colvec& col_j) {
auto joined_mat = arma::join_cols( col_i, col_j );
return sum(min(joined_mat)) / sum(max(joined_mat)); }
and then of course added "fjaccard" as an option in the similarity functions. If the code above is correct and the change is quite small, would you mind adding it yourself? I'm not sure how long it would take for me to figure out what's wrong with my path.
This similarity measure is very important in stylometry and I am developing a package for stylometry which is dependent on quanteda (https://github.com/andreanini/idiolect) so adding this to proxyC and/or quanteda would actually help lots of future users of my package and of quanteda.
I am developing the fuzzy Jaccard measure in issue-42
, and found disagreement between proxy::simil
and proxy::dist
. Only 1 - proxy::dist
looks correct. Which function are you using?
v1 <- c(0.1, 0.2, 0.3, 0.9)
v2 <- c(0.3, 0.1, 0.2, 0.4)
sum(pmin(v1, v2)) / sum(pmax(v1, v2))
#> [1] 0.4705882
proxyC::simil(v1, v2, method = "fjaccard", margin = 2)
#> 1 x 1 sparse Matrix of class "dgTMatrix"
#>
#> [1,] 0.4705882
proxy::simil(v1, v2, method = "fjaccard", by_rows = FALSE)
#> [,1]
#> [1,] 0.6538462
1 - proxy::dist(v1, v2, method = "fjaccard", by_rows = FALSE)
#> [,1]
#> [1,] 0.4705882
I use proxy::dist
, as proxy
lists the fuzzy jaccard coefficient among the distances. I have previously found issues in the way proxy
transforms similarities to distances. For example, proxy
transformed the cosine similarity to distance by doing 1 - similarity
, which is incorrect. This has now been fixed after I reported it. It could be that there is a similar bug here. Thanks for your help with this.
I think the C code for proxy::simil()
is wrong, but the R code for proxy::dist()
is correct. I reported to the maintainer. I don't understand why there are two sets of code.
yeah, this should be an easy coefficient to transform. Thanks again!
Is the eJaccard in this package equivalent to the min-max similarity (aka Ruzicka Distance aka fuzzy Jaccard)?