koheiw / proxyC

R package for large-scale similarity/distance computation
GNU General Public License v3.0
29 stars 6 forks source link

Adds the diag argument #13

Closed koheiw closed 3 years ago

koheiw commented 3 years ago

This PR adds the diag argument to simil() and dist(). When it is TRUE, they compute only similarity/distance between corresponding rows and columns in x and y. This saves a lot of time in some cases.

codecov[bot] commented 3 years ago

Codecov Report

Merging #13 (b872aae) into master (cccb6cd) will increase coverage by 0.09%. The diff coverage is 100.00%.

:exclamation: Current head b872aae differs from pull request most recent head ff8db16. Consider uploading reports for the commit ff8db16 to get more accurate results Impacted file tree graph

@@            Coverage Diff             @@
##           master      #13      +/-   ##
==========================================
+ Coverage   98.69%   98.79%   +0.09%     
==========================================
  Files           4        4              
  Lines         307      332      +25     
==========================================
+ Hits          303      328      +25     
  Misses          4        4              
Impacted Files Coverage Δ
R/proxy.R 98.97% <100.00%> (+0.05%) :arrow_up:
src/pair.cpp 98.40% <100.00%> (+0.30%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update cccb6cd...ff8db16. Read the comment docs.

rcannood commented 3 years ago

I'll take a look on Monday, would that be okay?

(Hooray for github actions!)

koheiw commented 3 years ago

Thanks. Looking forward too your comments.

rcannood commented 3 years ago

Looks good!

I guess that if performance is still an issue, you could create a struct proxy_diag : public Worker, which would remove a lot of code related to working with tuples & matrices, but it would make maintaining the package more difficult -- that is, making sure that the proxy_diag and proxy_pair offer the same functionality.

rcannood commented 3 years ago

... I don't know why R-CMD-check/windows-latest is failing. I only changed NEWS.md and .Rbuildignore. I'll trigger a rebuild to see if the problem persists...

rcannood commented 3 years ago

I just ran a devtools::check_win_devel() and also got the following output:

D:/Compiler/rtools40/mingw32/bin/../lib/gcc/i686-w64-mingw32/8.3.0/../../../../i686-w64-mingw32/bin/ld.exe: cannot find -ltbb
D:/Compiler/rtools40/mingw32/bin/../lib/gcc/i686-w64-mingw32/8.3.0/../../../../i686-w64-mingw32/bin/ld.exe: cannot find -ltbbmalloc
collect2.exe: error: ld returned 1 exit status
no DLL was created
ERROR: compilation failed for package 'proxyC'
* removing 'd:/RCompile/CRANguest/R-devel/lib/proxyC'

I see that RcppParallel was updated yesterday, maybe this is the reason?

koheiw commented 3 years ago

You are probably right that checks are failing on Windows (it still builds) because of the RcppParallel changes. The same error happen on my Windows after update to v5.1.1 from the source. I still do not to how to fix so hoping that Windows binary on CRAN will solve the issue...

koheiw commented 3 years ago

Regarding performance, proxy_pair is much slower than proxy_linear but the small number of pairs when diag = TRUE seems outweights the disadvantage. Let's see if we (or users) want to make it even faster.

> mat1 <- rsparsematrix(1000, 1000, 0.01, rand.x = sample.int)
> mat2 <- rsparsematrix(1000, 1000, 0.01, rand.x = sample.int)
> microbenchmark::microbenchmark(
+     diag(simil(mat1, mat2)),
+     diag(simil(mat1, mat2, diag = TRUE))
+ )
Unit: milliseconds
                                 expr        min        lq       mean     median         uq       max neval
              diag(simil(mat1, mat2)) 107.108599 117.91593 151.811298 127.999289 185.834339 325.83526   100
 diag(simil(mat1, mat2, diag = TRUE))   1.865122   2.26054   3.052994   2.424413   2.824056  13.87824   100
koheiw commented 3 years ago

A bug on RcppParallel on Windows has been confirmed, so let me merge.

https://github.com/RcppCore/RcppParallel/issues/118