ropensci / software-review

rOpenSci Software Peer Review.
286 stars 104 forks source link

Presubmission Inquiry: Zoomerjoin #588

Closed beniaminogreen closed 1 year ago

beniaminogreen commented 1 year ago

Submitting Author Name: Beniamino Green Submitting Author Github Handle: !--author1-->@beniaminogreen<!--end-author1-- Repository: https://github.com/beniaminogreen/zoomerjoin Submission type: Pre-submission Language: en


Package: zoomerjoin
Title: Insanely-Fast Fuzzy Joins
Version: 0.0.0.9000
Authors@R:
    person("Beniamino", "Green", , "beniamino.green@yale.edu", role = c("aut", "cre")
           )
Description: Zoomerjoin empowers users to fuzzily-merge dataframes with
millions or tens of millions of rows in minutes with minimal memory usage.  The
package uses the MinHash algorithm invented by Broder (1997)
<doi:10.1109/SEQUEN.1997.666900> to avoid having to compare every pair of
records in each data set, resulting in fuzzy-merges that finish in linear
time.  As a secondary feature, the package also wraps the rust kdtree crate
to provide euclidian-distance joins that finish in linearithmic time.
License: GPL (>= 3)
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.2.1
SystemRequirements: Cargo (>= 1.56) (Rust's package manager), rustc
Imports:
    dplyr,
    tibble
Suggests:
    babynames,
    fuzzyjoin,
    stringdist,
    knitr,
    rmarkdown,
    testthat (>= 3.0.0),
    tidyverse,
    microbenchmark,
    arrow,
    covr
Config/testthat/edition: 3
URL: https://beniaminogreen.github.io/zoomerjoin/
BugReports: https://github.com/beniaminogreen/zoomerjoin/issues/
VignetteBuilder: knitr

Scope

Unsure of this:

Data munging: This package helps users join massive tables on a messy identifying field. This is setting / data type is common or even the modal case when working with administrative data, text from OCR systems or other social science data.

Not a statistical package.

Researchers who work with large administrative, OCR, or genomics data, or data from other domains where transcription or recording errors are common. Users interested in large-scale probabilistic merges, or more sophisticated fuzzy-merge methods can also use the Locality Sensitive Hash (LSH) algorithms provided in this package as a form of preprocessing to save computation by using the package to identify potential matches and using a classifier to make the final match / non-match classification.

The excellent and renowned fuzzyjoin provides many excellent and tidy fuzzy-merging capabilities, but these do not scale to large datasets as they compare all pairs of potential matches between two datasets. See here for a representative thread detailing using fuzzyjoin with a larger dataset, and here showing a benchmark comparison of fuzzyjoin and zoomerjoin for small datasets.

The superlative textreuse is similar in that it implements a similar Locality-Sensitive Hash, but it does not offer a joining functionality and the implementation is mostly in R and might not scale well to datasets with rows in the hundreds of millions.

I try to synthesize the some of functionality of both packages, providing tidy joins that are backed by a performant, multithreaded Locality-Sensitive Hash algorithm written in Rust. This package combines the functionality of the tidy, dplyr-style fuzzyjoins provided by fuzzyjoin with the performance offered by the same Locality-Sensitive Hashing algorithm used in textreuse. The core of the package is written in sleek, performant Rust, which makes the package suitable for datasets with hundreds of millions of observations.

Not applicable - does not involve human subjects.

Thanks for taking the time to consider this pre-submission!

maurolepore commented 1 year ago

Thanks @beniaminogreen for this pre-submission. The package seems impressive. I'll run a few checks now and explore if it's in the scope of our current rOpenSci categories. I may need a few days to discuss with other editors.

maurolepore commented 1 year ago

@ropensci-review-bot check package

ropensci-review-bot commented 1 year ago

Thanks, about to send the query.

ropensci-review-bot commented 1 year ago

:rocket:

The following problems were found in your submission template:

:wave:

ropensci-review-bot commented 1 year ago

Checks for zoomerjoin (v0.0.0.9000)

git hash: 89991a51

Important: All failing checks above must be addressed prior to proceeding

(Checks marked with :eyes: may be optionally addressed.)

Package License: GPL (>= 3)


1. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate. |type |package | ncalls| |:----------|:--------------|------:| |internal |base | 44| |internal |zoomerjoin | 19| |internal |graphics | 2| |internal |stats | 1| |imports |dplyr | 6| |imports |tibble | NA| |suggests |igraph | 5| |suggests |arrow | NA| |suggests |babynames | NA| |suggests |covr | NA| |suggests |fuzzyjoin | NA| |suggests |knitr | NA| |suggests |microbenchmark | NA| |suggests |rmarkdown | NA| |suggests |stringdist | NA| |suggests |testthat | NA| |suggests |tidyverse | NA| |linking_to |NA | NA| Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats()', and examining the 'external_calls' table.

base

names (12), by (6), intersect (6), seq (5), gsub (4), nrow (4), sapply (2), c (1), expand.grid (1), mode (1), return (1), which.min (1)

zoomerjoin

rust_lsh_join (2), em_link (1), jaccard_similarity (1), kd_anti_join (1), kd_by_validate (1), kd_full_join (1), kd_inner_join (1), kd_join_core (1), kd_left_join (1), kd_right_join (1), lsh_anti_join (1), lsh_curve (1), lsh_full_join (1), lsh_hyper_grid_search (1), lsh_inner_join (1), lsh_join (1), lsh_left_join (1), rust_kd_join (1)

dplyr

pull (4), bind_cols (2)

igraph

as.undirected (1), fastgreedy.community (1), graph_from_edgelist (1), groups (1), membership (1)

graphics

pairs (2)

stats

df (1)

**NOTE:** Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.


2. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has: - code in C (1% in 1 files), R (40% in 9 files), Rust (57% in 5 files) and TOML (3% in 1 files) - 1 authors - 2 vignettes - no internal data file - 2 imported packages - 16 exported functions (median 6 lines of code) - 34 non-exported functions in R (median 5 lines of code) - 6 R functions (median 3 lines of code) --- Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used: - `loc` = "Lines of Code" - `fn` = "function" - `exp`/`not_exp` = exported / not exported All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by [the `checks_to_markdown()` function](https://docs.ropensci.org/pkgcheck/reference/checks_to_markdown.html) The final measure (`fn_call_network_size`) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile. |measure | value| percentile|noteworthy | |:------------------------|-----:|----------:|:----------| |files_R | 9| 55.2| | |files_src | 7| 91.1| | |files_vignettes | 2| 85.7| | |files_tests | 9| 89.6| | |loc_R | 316| 33.9| | |loc_src | 481| 46.0| | |loc_vignettes | 130| 34.0| | |loc_tests | 252| 60.3| | |num_vignettes | 2| 89.2| | |n_fns_r | 50| 56.6| | |n_fns_r_exported | 16| 60.6| | |n_fns_r_not_exported | 34| 56.6| | |n_fns_src | 6| 19.0| | |n_fns_per_file_r | 3| 47.8| | |n_fns_per_file_src | 3| 29.7| | |num_params_per_fn | 4| 54.6| | |loc_per_fn_r | 6| 11.7| | |loc_per_fn_r_exp | 6| 10.5| | |loc_per_fn_r_not_exp | 5| 9.7| | |loc_per_fn_src | 3| 0.7|TRUE | |rel_whitespace_R | 24| 44.6| | |rel_whitespace_src | 25| 53.5| | |rel_whitespace_vignettes | 26| 25.8| | |rel_whitespace_tests | 31| 66.4| | |doclines_per_fn_exp | 34| 41.6| | |doclines_per_fn_not_exp | 0| 0.0|TRUE | |fn_call_network_size | 21| 47.5| | ---

2a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package


3. goodpractice and other checks

Details of goodpractice checks (click to open)

#### 3a. Continuous Integration Badges (There do not appear to be any) **GitHub Workflow Results** | id|name |conclusion |sha | run_number|date | |----------:|:--------------------------|:----------|:------|----------:|:----------| | 4792462189|pages build and deployment |success |25abf4 | 59|2023-04-25 | | 4792426799|pkgdown |success |89991a | 109|2023-04-25 | | 4792426798|R-CMD-check |success |89991a | 117|2023-04-25 | | 4792426795|test-coverage |success |89991a | 65|2023-04-25 | --- #### 3b. `goodpractice` results #### `R CMD check` with [rcmdcheck](https://r-lib.github.io/rcmdcheck/) R CMD check generated the following notes: 1. checking installed package size ... NOTE installed size is 15.5Mb sub-directories of 1Mb or more: libs 15.1Mb 2. checking R code for possible problems ... NOTE lsh_string_group: no visible global function definition for ‘installed.packages’ Undefined global functions or variables: installed.packages Consider adding importFrom("utils", "installed.packages") to your NAMESPACE file. R CMD check generated the following check_fails: 1. rcmdcheck_undefined_globals 2. rcmdcheck_reasonable_installed_size #### Test coverage with [covr](https://covr.r-lib.org/) Package coverage: 82.02 #### Cyclocomplexity with [cyclocomp](https://github.com/MangoTheCat/cyclocomp) The following function have cyclocomplexity >= 15: function | cyclocomplexity --- | --- lsh_join | 15 #### Static code analyses with [lintr](https://github.com/jimhester/lintr) [lintr](https://github.com/jimhester/lintr) found the following 64 potential issues: message | number of times --- | --- Avoid library() and require() calls in packages | 7 Avoid using sapply, consider vapply instead, that's type safe | 1 Lines should not be more than 80 characters. | 56


4. Other Checks

Details of other checks (click to open)

:heavy_multiplication_x: The following 2 function names are duplicated in other packages: - - `jaccard_similarity` from textreuse - - `lsh_probability` from textreuse


Package Versions

|package |version | |:--------|:--------| |pkgstats |0.1.3.4 | |pkgcheck |0.1.1.23 |


Editor-in-Chief Instructions:

Processing may not proceed until the items marked with :heavy_multiplication_x: have been resolved.

beniaminogreen commented 1 year ago

Thanks @beniaminogreen for this pre-submission. The package seems impressive. I'll run a few checks now and explore if it's in the scope of our current rOpenSci categories. I may need a few days to discuss with other editors.

Thanks for having a look at the package!

maurolepore commented 1 year ago

Thanks @beniaminogreen for your patience.

I discussed with the editor's board. Unfortunately we believe this package is out of scope for our current categories. The "data-munging" category in particular fits packages that handle less structured data.

I'm sure this package will be super useful for many users (including myself) and I look forward to seeing it on CRAN.

Thanks again for sharing it with rOpenSci and please do think of us again next time you have a package that you think might fit in our scope.

All the best!

maurolepore commented 1 year ago

@ropensci-review-bot out of scope

beniaminogreen commented 1 year ago

Thanks @beniaminogreen for your patience.

I discussed with the editor's board. Unfortunately we believe this package is out of scope for our current categories. The "data-munging" category in particular fits packages that handle less structured data.

I'm sure this package will be super useful for many users (including myself) and I look forward to seeing it on CRAN.

Thanks again for sharing it with rOpenSci and please do think of us again next time you have a package that you think might fit in our scope.

All the best!

No worries. Thanks for taking the time to look over the package, and for your kind words about it.

Best, Ben