koheiw / proxyC

R package for large-scale similarity/distance computation
GNU General Public License v3.0
29 stars 6 forks source link

Add fjaccard #42

Open andreanini opened 1 year ago

andreanini commented 1 year ago

Is the eJaccard in this package equivalent to the min-max similarity (aka Ruzicka Distance aka fuzzy Jaccard)?

koheiw commented 1 year ago

Please see https://cran.rstudio.com/web/packages/proxyC/vignettes/measures.html

andreanini commented 1 year ago

Dear Kohei, Thank you very much for this and for all the work you are doing for quanteda. Really amazing! I saw the vignette but I suppose I do not have enough sophistication with the notation to understand how that formula (which looks like the Tanimoto coefficient?) relates to those other variants of Jaccard. Min-max or Ruzicka is:

$\frac{\sum_i min(x_i, y_i)}{\sum_i max(x_i, y_i)}$

I suppose I was hoping for more clarifications on the mathematics rather than on the code implementation.

koheiw commented 1 year ago

This package was originally created to replicate the proxy package for text analysis (for textstat_simil()). I can find above formula as `fjaccard" in their vignette. I did not implement it but I could.

https://cran.r-project.org/web/packages/proxy/vignettes/overview.pdf

andreanini commented 1 year ago

oh yes, of course. They list both of them so it makes sense that they are distinct coefficients. Sorry about that. At the moment I'm using proxy to run the "fjaccard" but, of course, I'd love to use textstat_simil() instead. Much faster!

andreanini commented 10 months ago

Dear Kohei,

I forked your repo as I believed I could easily add this myself and then send a pull request. I made the change, I think, but I'm not an expert of C++ and I'm stuck at loading the R package to test it. It seems there is a library missing or not in the right path.

The change I made is I added the following function to pair.cpp

double simil_fjaccard(colvec& col_i, colvec& col_j) {
    auto joined_mat = arma::join_cols( col_i, col_j );
    return sum(min(joined_mat)) / sum(max(joined_mat)); }

and then of course added "fjaccard" as an option in the similarity functions. If the code above is correct and the change is quite small, would you mind adding it yourself? I'm not sure how long it would take for me to figure out what's wrong with my path.

This similarity measure is very important in stylometry and I am developing a package for stylometry which is dependent on quanteda (https://github.com/andreanini/idiolect) so adding this to proxyC and/or quanteda would actually help lots of future users of my package and of quanteda.

koheiw commented 10 months ago

I am developing the fuzzy Jaccard measure in issue-42, and found disagreement between proxy::simil and proxy::dist. Only 1 - proxy::dist looks correct. Which function are you using?

v1 <- c(0.1, 0.2, 0.3, 0.9)
v2 <- c(0.3, 0.1, 0.2, 0.4)

sum(pmin(v1, v2)) / sum(pmax(v1, v2))
#> [1] 0.4705882

proxyC::simil(v1, v2, method = "fjaccard", margin = 2) 
#> 1 x 1 sparse Matrix of class "dgTMatrix"
#>               
#> [1,] 0.4705882

proxy::simil(v1, v2, method = "fjaccard", by_rows = FALSE)
#>      [,1]     
#> [1,] 0.6538462
1 - proxy::dist(v1, v2, method = "fjaccard", by_rows = FALSE)
#>      [,1]     
#> [1,] 0.4705882
andreanini commented 10 months ago

I use proxy::dist, as proxy lists the fuzzy jaccard coefficient among the distances. I have previously found issues in the way proxy transforms similarities to distances. For example, proxy transformed the cosine similarity to distance by doing 1 - similarity, which is incorrect. This has now been fixed after I reported it. It could be that there is a similar bug here. Thanks for your help with this.

koheiw commented 10 months ago

I think the C code for proxy::simil() is wrong, but the R code for proxy::dist() is correct. I reported to the maintainer. I don't understand why there are two sets of code.

andreanini commented 10 months ago

yeah, this should be an easy coefficient to transform. Thanks again!