Presubmission Inquiry: Zoomerjoin

beniaminogreen commented 1 year ago

Submitting Author Name: Beniamino Green Submitting Author Github Handle: !--author1-->@beniaminogreen<!--end-author1-- Repository: https://github.com/beniaminogreen/zoomerjoin Submission type: Pre-submission Language: en

Paste the full DESCRIPTION file inside a code block below:

Package: zoomerjoin
Title: Insanely-Fast Fuzzy Joins
Version: 0.0.0.9000
Authors@R:
    person("Beniamino", "Green", , "beniamino.green@yale.edu", role = c("aut", "cre")
           )
Description: Zoomerjoin empowers users to fuzzily-merge dataframes with
millions or tens of millions of rows in minutes with minimal memory usage.  The
package uses the MinHash algorithm invented by Broder (1997)
<doi:10.1109/SEQUEN.1997.666900> to avoid having to compare every pair of
records in each data set, resulting in fuzzy-merges that finish in linear
time.  As a secondary feature, the package also wraps the rust kdtree crate
to provide euclidian-distance joins that finish in linearithmic time.
License: GPL (>= 3)
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.2.1
SystemRequirements: Cargo (>= 1.56) (Rust's package manager), rustc
Imports:
    dplyr,
    tibble
Suggests:
    babynames,
    fuzzyjoin,
    stringdist,
    knitr,
    rmarkdown,
    testthat (>= 3.0.0),
    tidyverse,
    microbenchmark,
    arrow,
    covr
Config/testthat/edition: 3
URL: https://beniaminogreen.github.io/zoomerjoin/
BugReports: https://github.com/beniaminogreen/zoomerjoin/issues/
VignetteBuilder: knitr

Scope

Please indicate which category or categories from our package fit policies or statistical package categories this package falls under. (Please check an appropriate box below):

Data Lifecycle Packages
- [ ] data retrieval
- [ ] data extraction
- [x] data munging
- [ ] data deposition
  - [ ] data validation and testing
- [ ] workflow automation
- [ ] version control
- [ ] citation management and bibliometrics
- [ ] scientific software wrappers
- [ ] field and lab reproducibility tools
- [ ] database software bindings
- [ ] geospatial data
- [ ] text analysis
  
  Statistical Packages
- [ ] Bayesian and Monte Carlo Routines
- [ ] Dimensionality Reduction, Clustering, and Unsupervised Learning
- [ ] Machine Learning
- [ ] Regression and Supervised Learning
- [ ] Exploratory Data Analysis (EDA) and Summary Statistics
- [ ] Spatial Analyses
- [ ] Time Series Analyses
Explain how and why the package falls under these categories (briefly, 1-2 sentences). Please note any areas you are unsure of:

Unsure of this:

Data munging: This package helps users join massive tables on a messy identifying field. This is setting / data type is common or even the modal case when working with administrative data, text from OCR systems or other social science data.

If submitting a statistical package, have you already incorporated documentation of standards into your code via the srr package?

Not a statistical package.

Who is the target audience and what are scientific applications of this package?

Researchers who work with large administrative, OCR, or genomics data, or data from other domains where transcription or recording errors are common. Users interested in large-scale probabilistic merges, or more sophisticated fuzzy-merge methods can also use the Locality Sensitive Hash (LSH) algorithms provided in this package as a form of preprocessing to save computation by using the package to identify potential matches and using a classifier to make the final match / non-match classification.

Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?

The excellent and renowned fuzzyjoin provides many excellent and tidy fuzzy-merging capabilities, but these do not scale to large datasets as they compare all pairs of potential matches between two datasets. See here for a representative thread detailing using fuzzyjoin with a larger dataset, and here showing a benchmark comparison of fuzzyjoin and zoomerjoin for small datasets.

The superlative textreuse is similar in that it implements a similar Locality-Sensitive Hash, but it does not offer a joining functionality and the implementation is mostly in R and might not scale well to datasets with rows in the hundreds of millions.

I try to synthesize the some of functionality of both packages, providing tidy joins that are backed by a performant, multithreaded Locality-Sensitive Hash algorithm written in Rust. This package combines the functionality of the tidy, dplyr-style fuzzyjoins provided by fuzzyjoin with the performance offered by the same Locality-Sensitive Hashing algorithm used in textreuse. The core of the package is written in sleek, performant Rust, which makes the package suitable for datasets with hundreds of millions of observations.

(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?

Not applicable - does not involve human subjects.

Any other questions or issues we should be aware of?

Thanks for taking the time to consider this pre-submission!

maurolepore commented 1 year ago

Thanks @beniaminogreen for this pre-submission. The package seems impressive. I'll run a few checks now and explore if it's in the scope of our current rOpenSci categories. I may need a few days to discuss with other editors.

maurolepore commented 1 year ago

@ropensci-review-bot check package

ropensci-review-bot commented 1 year ago

Thanks, about to send the query.

ropensci-review-bot commented 1 year ago

:rocket:

The following problems were found in your submission template:

submission type must be one of [Standard, Estandar, Stats]
HTML variable [editor] is missing
HTML variable [reviewers-list] is missing
HTML variable [due-dates-list] is missing Editors: Please ensure these problems with the submission template are rectified. Package checks have been started regardless.

:wave:

ropensci-review-bot commented 1 year ago

Checks for zoomerjoin (v0.0.0.9000)

git hash: 89991a51

:heavy_check_mark: Package name is available
:heavy_check_mark: has a 'codemeta.json' file.
:heavy_check_mark: has a 'contributing' file.
:heavy_multiplication_x: The following function has no documented return value: [lsh_string_group]
:heavy_check_mark: uses 'roxygen2'.
:heavy_check_mark: 'DESCRIPTION' has a URL field.
:heavy_check_mark: 'DESCRIPTION' has a BugReports field.
:heavy_check_mark: Package has at least one HTML vignette
:heavy_check_mark: All functions have examples.
:heavy_check_mark: Package has continuous integration checks.
:heavy_check_mark: Package coverage is 82%.
:heavy_check_mark: R CMD check found no errors.
:heavy_check_mark: R CMD check found no warnings.
:eyes: Function names are duplicated in other packages

Important: All failing checks above must be addressed prior to proceeding

(Checks marked with :eyes: may be optionally addressed.)

Package License: GPL (>= 3)

1. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate. |type |package | ncalls| |:----------|:--------------|------:| |internal |base | 44| |internal |zoomerjoin | 19| |internal |graphics | 2| |internal |stats | 1| |imports |dplyr | 6| |imports |tibble | NA| |suggests |igraph | 5| |suggests |arrow | NA| |suggests |babynames | NA| |suggests |covr | NA| |suggests |fuzzyjoin | NA| |suggests |knitr | NA| |suggests |microbenchmark | NA| |suggests |rmarkdown | NA| |suggests |stringdist | NA| |suggests |testthat | NA| |suggests |tidyverse | NA| |linking_to |NA | NA| Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats()', and examining the 'external_calls' table.

base

names (12), by (6), intersect (6), seq (5), gsub (4), nrow (4), sapply (2), c (1), expand.grid (1), mode (1), return (1), which.min (1)

zoomerjoin

rust_lsh_join (2), em_link (1), jaccard_similarity (1), kd_anti_join (1), kd_by_validate (1), kd_full_join (1), kd_inner_join (1), kd_join_core (1), kd_left_join (1), kd_right_join (1), lsh_anti_join (1), lsh_curve (1), lsh_full_join (1), lsh_hyper_grid_search (1), lsh_inner_join (1), lsh_join (1), lsh_left_join (1), rust_kd_join (1)

dplyr

pull (4), bind_cols (2)

igraph

as.undirected (1), fastgreedy.community (1), graph_from_edgelist (1), groups (1), membership (1)

graphics

pairs (2)

stats

df (1)

**NOTE:** Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.

2. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has: - code in C (1% in 1 files), R (40% in 9 files), Rust (57% in 5 files) and TOML (3% in 1 files) - 1 authors - 2 vignettes - no internal data file - 2 imported packages - 16 exported functions (median 6 lines of code) - 34 non-exported functions in R (median 5 lines of code) - 6 R functions (median 3 lines of code) --- Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used: - `loc` = "Lines of Code" - `fn` = "function" - `exp`/`not_exp` = exported / not exported All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by [the `checks_to_markdown()` function](https://docs.ropensci.org/pkgcheck/reference/checks_to_markdown.html) The final measure (`fn_call_network_size`) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile. |measure | value| percentile|noteworthy | |:------------------------|-----:|----------:|:----------| |files_R | 9| 55.2| | |files_src | 7| 91.1| | |files_vignettes | 2| 85.7| | |files_tests | 9| 89.6| | |loc_R | 316| 33.9| | |loc_src | 481| 46.0| | |loc_vignettes | 130| 34.0| | |loc_tests | 252| 60.3| | |num_vignettes | 2| 89.2| | |n_fns_r | 50| 56.6| | |n_fns_r_exported | 16| 60.6| | |n_fns_r_not_exported | 34| 56.6| | |n_fns_src | 6| 19.0| | |n_fns_per_file_r | 3| 47.8| | |n_fns_per_file_src | 3| 29.7| | |num_params_per_fn | 4| 54.6| | |loc_per_fn_r | 6| 11.7| | |loc_per_fn_r_exp | 6| 10.5| | |loc_per_fn_r_not_exp | 5| 9.7| | |loc_per_fn_src | 3| 0.7|TRUE | |rel_whitespace_R | 24| 44.6| | |rel_whitespace_src | 25| 53.5| | |rel_whitespace_vignettes | 26| 25.8| | |rel_whitespace_tests | 31| 66.4| | |doclines_per_fn_exp | 34| 41.6| | |doclines_per_fn_not_exp | 0| 0.0|TRUE | |fn_call_network_size | 21| 47.5| | ---

2a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package

3. `goodpractice` and other checks

Details of goodpractice checks (click to open)

#### 3a. Continuous Integration Badges (There do not appear to be any) **GitHub Workflow Results** | id|name |conclusion |sha | run_number|date | |----------:|:--------------------------|:----------|:------|----------:|:----------| | 4792462189|pages build and deployment |success |25abf4 | 59|2023-04-25 | | 4792426799|pkgdown |success |89991a | 109|2023-04-25 | | 4792426798|R-CMD-check |success |89991a | 117|2023-04-25 | | 4792426795|test-coverage |success |89991a | 65|2023-04-25 | --- #### 3b. `goodpractice` results #### `R CMD check` with [rcmdcheck](https://r-lib.github.io/rcmdcheck/) R CMD check generated the following notes: 1. checking installed package size ... NOTE installed size is 15.5Mb sub-directories of 1Mb or more: libs 15.1Mb 2. checking R code for possible problems ... NOTE lsh_string_group: no visible global function definition for ‘installed.packages’ Undefined global functions or variables: installed.packages Consider adding importFrom("utils", "installed.packages") to your NAMESPACE file. R CMD check generated the following check_fails: 1. rcmdcheck_undefined_globals 2. rcmdcheck_reasonable_installed_size #### Test coverage with [covr](https://covr.r-lib.org/) Package coverage: 82.02 #### Cyclocomplexity with [cyclocomp](https://github.com/MangoTheCat/cyclocomp) The following function have cyclocomplexity >= 15: function | cyclocomplexity --- | --- lsh_join | 15 #### Static code analyses with [lintr](https://github.com/jimhester/lintr) [lintr](https://github.com/jimhester/lintr) found the following 64 potential issues: message | number of times --- | --- Avoid library() and require() calls in packages | 7 Avoid using sapply, consider vapply instead, that's type safe | 1 Lines should not be more than 80 characters. | 56

4. Other Checks

Details of other checks (click to open)

:heavy_multiplication_x: The following 2 function names are duplicated in other packages: - - `jaccard_similarity` from textreuse - - `lsh_probability` from textreuse

Package Versions

|package |version | |:--------|:--------| |pkgstats |0.1.3.4 | |pkgcheck |0.1.1.23 |

Editor-in-Chief Instructions:

Processing may not proceed until the items marked with :heavy_multiplication_x: have been resolved.

beniaminogreen commented 1 year ago

Thanks @beniaminogreen for this pre-submission. The package seems impressive. I'll run a few checks now and explore if it's in the scope of our current rOpenSci categories. I may need a few days to discuss with other editors.

Thanks for having a look at the package!

maurolepore commented 1 year ago

Thanks @beniaminogreen for your patience.

I discussed with the editor's board. Unfortunately we believe this package is out of scope for our current categories. The "data-munging" category in particular fits packages that handle less structured data.

I'm sure this package will be super useful for many users (including myself) and I look forward to seeing it on CRAN.

Thanks again for sharing it with rOpenSci and please do think of us again next time you have a package that you think might fit in our scope.

All the best!

maurolepore commented 1 year ago

@ropensci-review-bot out of scope

beniaminogreen commented 1 year ago

Thanks @beniaminogreen for your patience.

I discussed with the editor's board. Unfortunately we believe this package is out of scope for our current categories. The "data-munging" category in particular fits packages that handle less structured data.

I'm sure this package will be super useful for many users (including myself) and I look forward to seeing it on CRAN.

Thanks again for sharing it with rOpenSci and please do think of us again next time you have a package that you think might fit in our scope.

All the best!

No worries. Thanks for taking the time to look over the package, and for your kind words about it.

Best, Ben

ropensci / software-review