koheiw / proxyC

R package for large-scale similarity/distance computation
GNU General Public License v3.0
29 stars 6 forks source link

Return denseMatrix to save RAM #38

Closed koheiw closed 4 months ago

koheiw commented 1 year ago

Return denseMatrix (dgeMatrix/dspMatrix) when none of drop0, rank or min_simil is used to address #37.

codecov[bot] commented 1 year ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 99.49%. Comparing base (7395cf3) to head (c09b291). Report is 9 commits behind head on master.

:exclamation: Current head c09b291 differs from pull request most recent head 8c54b76. Consider uploading reports for the commit 8c54b76 to get more accurate results

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #38 +/- ## ========================================== - Coverage 99.76% 99.49% -0.27% ========================================== Files 5 4 -1 Lines 427 399 -28 ========================================== - Hits 426 397 -29 - Misses 1 2 +1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

rcannood commented 1 year ago

Running a revdep_check indicates that these changes might break something in quanteda.textstats:

> revdepcheck::revdep_check(timeout = as.difftime(600, units = "mins"), num_workers = 30)
✔ applicable 0.1.0                       ── E: 0     | W: 0     | N: 0                                                             
✔ scClassify 1.6.0                       ── E: 1     | W: 0     | N: 0                                                             
✔ dynutils 1.0.10                        ── E: 0     | W: 0     | N: 0                                                             
✔ immcp 1.0.3                            ── E: 1     | W: 0     | N: 0                                                             
✔ LSX 1.1.1                              ── E: 0     | W: 0     | N: 1                                                             
✖ quanteda.textstats 0.96                ── E: 0  +1 | W: 0     | N: 1                                                             
✔ seededlda 0.8.1                        ── E: 0     | W: 0     | N: 2                                                             
✔ simplifyEnrichment 1.4.0               ── E: 0     | W: 0     | N: 3                                                             
OK: 7                                                                                                                            
BROKEN: 1
Total time: 5 min

Newly broken

rcannood commented 10 months ago

It seems this is still resulting in more or less the same error as before:

* checking tests ... ERROR
  Running ‘spelling.R’
  Running ‘testthat.R’
Running the tests in ‘tests/testthat.R’ failed.
Last 13 lines of output:
  textstat_proxy(test_mt) inherits from `'dspMatrix'` not `'character'`.
  ── Failure ('test-textstat_proxy.R:271:5'): sparse objects are of expected class and occur when expected ──
  textstat_proxy(test_mt, method = "kullback") inherits from `'dgeMatrix'` not `'character'`.
  ── Failure ('test-textstat_simil.R:234:5'): textstat_simil coercion methods work with options ──
  nrow(as.data.frame(tstat, upper = FALSE, diag = TRUE)) not equal to (nrow(mt2)^2 - ndoc(mt2))/2 + ndoc(mt2).
  1/1 mismatches
  [1] 25 - 15 == 10
  ── Failure ('test-textstat_simil.R:254:5'): textstat_simil coercion methods work with options ──
  nrow(as.data.frame(tstat, upper = FALSE, diag = FALSE)) not equal to (nrow(mt2)^2 - ndoc(mt2))/2.
  1/1 mismatches
  [1] 20 - 10 == 10

  [ FAIL 4 | WARN 0 | SKIP 2 | PASS 478 ]
  Error: Test failures
  Execution halted
* DONE
Status: 1 ERROR, 1 NOTE

However, maybe this error simply means that there is a test in quanteda.textstats that needs to be updated? What do you think?

koheiw commented 10 months ago

@rcannood thanks for checking. I left this branch for while because automatically changing output format might might break make users code. How about adding sparse = TRUE as an argument to allow users get a dense matrix by setting it FALSE? It is akin to xtabs(sparse = FALSE).

When sparse = TRUE and none of drop0, rank or min_simil is used, we can warn users that sparse = FALSE is more efficient.

rcannood commented 10 months ago

That's a good idea ! :+1: