openjournals / joss-reviews

Reviews for the Journal of Open Source Software
Creative Commons Zero v1.0 Universal
714 stars 38 forks source link

[REVIEW]: Zoomerjoin: Superlatively Fast Fuzzy-Joins #5693

Closed editorialbot closed 1 year ago

editorialbot commented 1 year ago

Submitting author: !--author-handle-->@beniaminogreen<!--end-author-handle-- (Beniamino Green) Repository: https://github.com/beniaminogreen/zoomerjoin Branch with paper.md (empty if default branch): joss Version: v0.1.0 Editor: !--editor-->@samhforbes<!--end-editor-- Reviewers: @cjbarrie, @wincowgerDEV Archive: 10.5281/zenodo.8370652

Status

status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/b68ba4bfbce52e56585aea54a5bcba54"><img src="https://joss.theoj.org/papers/b68ba4bfbce52e56585aea54a5bcba54/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/b68ba4bfbce52e56585aea54a5bcba54/status.svg)](https://joss.theoj.org/papers/b68ba4bfbce52e56585aea54a5bcba54)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@cjbarrie & @wincowgerDEV, your review will be checklist based. Each of you will have a separate checklist that you should update when carrying out your review. First of all you need to run this command in a separate comment to create the checklist:

@editorialbot generate my checklist

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @samhforbes know.

Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest

Checklists

📝 Checklist for @cjbarrie

📝 Checklist for @wincowgerDEV

editorialbot commented 1 year ago

Hello humans, I'm @editorialbot, a robot that can help you with some common editorial tasks.

For a list of things I can do to help you, just type:

@editorialbot commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@editorialbot generate pdf
editorialbot commented 1 year ago
Software report:

github.com/AlDanial/cloc v 1.88  T=0.21 s (211.6 files/s, 24419.4 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Markdown                         6            239              0            960
HTML                             1             12              2            755
R                               18            189            784            706
Rust                             6            125              9            479
JSON                             1              0              0            189
YAML                             6             35             12            183
TeX                              1              0              0            141
Rmd                              3             73            205             55
TOML                             1              2              0             19
C                                1              2              2              4
Bourne Shell                     1              1              8              2
-------------------------------------------------------------------------------
SUM:                            45            678           1022           3493
-------------------------------------------------------------------------------

gitinspector failed to run statistical information for the repository
editorialbot commented 1 year ago

Wordcount for paper.md is 1491

editorialbot commented 1 year ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1109/sequen.1997.666900 is OK
- 10.1017/CBO9781139924801 is OK
- 10.1093/bioinformatics/btz354 is OK
- 10.1145/997817.997857 is OK
- 10.1017/pan.2021.38 is OK
- 10.32614/RJ-2014-011 is OK
- 10.2139/ssrn.3214172 is OK
- 10.1080/01621459.1969.10501049 is OK

MISSING DOIs

- None

INVALID DOIs

- None
editorialbot commented 1 year ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

samhforbes commented 1 year ago

👋🏼 @beniaminogreen, @cjbarrie, @wincowgerDEV this is the review thread for the paper. All of our communications will happen here from now on.

As a reviewer, the first step is to create a checklist for your review by entering

@editorialbot generate my checklist

as the top of a new comment in this thread.

These checklists contain the JOSS requirements. As you go over the submission, please check any items that you feel have been satisfied. The first comment in this thread also contains links to the JOSS reviewer guidelines.

The JOSS review is different from most other journals. Our goal is to work with the authors to help them meet our criteria instead of merely passing judgment on the submission. As such, the reviewers are encouraged to submit issues and pull requests on the software repository. Please also feel free to comment and ask questions on this thread. In my experience, it is better to post comments/questions/suggestions as you come across them instead of waiting until you've reviewed the entire package.

We aim for reviews to be completed within about 2-4 weeks. Please let me know if any of you require some more time. We can also use EditorialBot (our bot) to set automatic reminders if you know you'll be away for a known period of time.

Please feel free to ping me if you have any questions/concerns.

cjbarrie commented 1 year ago

Review checklist for @cjbarrie

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

cjbarrie commented 1 year ago

This is a very good package. Easy to use and well written. There are just a few small changes I would recommend.

  1. You do not provide the code you used to arrive at the benchmarks. It would be helpful to have this to test the claims of the article.
  2. You do not attach data in the package that you use (corpus_1 and corpus_2) to demonstrate the package functionality. I think some version of this could be attached within R package file size constraints. If not, just provide a link as Wickham has done for e.g., babynames package in the past.
  3. It would be helpful to have vignette documentation for the other package functions. The documentation remains quite minimal at the moment.
  4. There are no Community guidelines I can see on the Github repo.

Additionally, some small typos in the article:

For validating the package functionality, I did the following for now. It would be helpful if some version of this were incorporated in the documentation so users can test functionality easily:


library(dplyr)

url <- "https://raw.githubusercontent.com/beniaminogreen/zoomerjoin/main/vignettes/bonica.csv"
data <- read.csv(url(url))

set.seed(123L)
corpus_1 <- data %>%
  sample_n(500000) %>%
  select(x) %>%
  rename(field = x)

corpus_2 <- data %>%
  sample_n(500000) %>%
  select(x) %>%
  rename(field = x)

start_time <- Sys.time()
join_out <- jaccard_inner_join(corpus_1, corpus_2,
                               by = "field", n_gram_width=6,
                               n_bands=20, band_width=6, threshold = .8)
print(Sys.time() - start_time)
beniaminogreen commented 1 year ago

Hi,

Thanks for your detailed feedback and your kind words about the package.

Writing in response to the four points you raise:

1) Thanks for flagging the lack of the benchmarking code. I originally included this in the package, but it was accidentally deleted at some stage. I will include it at the top of the benchmarking vignette so that it's easily available to you and other users. Unfortunately, the benchmarking takes something like 6 hours to run on a 8-core server. Most of this time is spent running the fuzzyjoin code, so it might be practical to only run the zoomerjoin code to verify its performance. Alternatively, I can reduce the size of the datasets or the number of trials to make sure it is replicable in a reasonable amount of time.

2) I have the corpus data in the vignettes file, but I am not sure if this is the right place. I will read through the R package book and make sure that I put the files in an appropriate place, and that the files are visible to users. I also very much like the code snippet that you provided, so I will include it or something similar.

3) I agree that the package should have better documentation for the Euclidean distance joins, and the fuzzy string grouping, which I will add in the next few days.

4) I have included a contributing guide / community code of conduct in the contributing.md file in the root of the repo. Not sure if this is the right place, or the right format for the guidelines. I'll change it in the next commit if this does not meet the requirements.

I'll also make sure to fix the typos you flag in the paper.

Best and many thanks, Ben

beniaminogreen commented 1 year ago

@editorialbot generate pdf

editorialbot commented 1 year ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

beniaminogreen commented 1 year ago

Writing to share some of the updates to the package over the past few days. In addition to fixing typos and fixing a bug in the euclidean joining code, I have:

1) Updated the guided tour vingette to show off the jaccard_string_group and jaccard_curve functions, as well as adding a very short vingette on joining using the euclidean distance.

2) added the DIME / Bonica data so it is packaged with the R repository, which makes the functionality of the package easier to experiment. This change is reflected in the README and vignettes.

3) The benchmarking vingette now contains the code that is used to generate the benchmarks (although it isn't run every time). As mentioned before, this code takes several hours to run due to the large size of the datasets used.

Because of this, I have also uploaded this gist which runs the same benchmarking code over a smaller set of inputs to show the difference in scaling between the two methods for smaller datasets. It should run in ~20 mins on a modern laptop.

wincowgerDEV commented 1 year ago

Review checklist for @wincowgerDEV

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

wincowgerDEV commented 1 year ago

@beniaminogreen, I ran into an issue with the installation and posted here: https://github.com/beniaminogreen/zoomerjoin/issues/70

wincowgerDEV commented 1 year ago

@beniaminogreen recommendation on line 77 of the manuscript, I don't think you want to claim that your package never creates false positives. True positive matches in character datasets are often subjective and may be complex (beyond the capabilities of jacard similarity), it may confuse readers when they do not get the expected output.

wincowgerDEV commented 1 year ago

@beniaminogreen, I recommend adding some code examples in text to demonstrate your points, it will help readers understand which function to use to achieve which outcomes you are talking about.

wincowgerDEV commented 1 year ago

@beniaminogreen, it isn't clear where the code for producing the euclidean synthetic dataset is described in the manuscript. I think there should be a link in the manuscript to where the code is that produced the figure.

beniaminogreen commented 1 year ago

I agree with the comment that "no false positives" might mislead users as to what zoomerjoin can accomplish. I'd still like to flag the fact that zoomerjoin will never identify units that are less similar than the threshold the user selects as matches, but I'll see if there's a way to say this that sets expectations better.

The code for the synthetic dataset and benchmarking can be found in this vignette, and I'll include a link in the paper. I also like the suggestion of adding code examples, so I'll add a small example of joining using the Euclidean and Jaccard distances to the article.

I thought that including a contrubuting.md file was enough to satisfy the JOSS requirements, but since both of you have raised this as a concern, I will write up a more detailed set of instructions on contributing or reporting bugs.

Best, Ben

beniaminogreen commented 1 year ago

@editorialbot generate pdf

editorialbot commented 1 year ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

beniaminogreen commented 1 year ago

Updating the article to make it responsive to comments above. Specifically: have added code for the synthetic dataset, linked to benchmarking vignette, and added an example use of euclidean_inner_join. Also, have removed sentence about no false positives because I agree it could set the wrong expectation for people considering the package.

Added contributing instructions in a separate update to the repo.

Best, Ben

beniaminogreen commented 1 year ago

Hi All,

Thanks for your detailed feedback about the package.

In response to some of the comments you've given, I have made the rust code in the package capture R's random seed, so you can set the seed and make sure that all the results are repicable. This helped me catch a locking issue with some of the rust code.

I've bundled a version of the dime dataset with the package so that users can start experimenting with the joining function immediately after installing the package.

It looks like both of you are having difficulty replicating some of the performance claims in the paper. Ideally, I would like to include benchmarks on large datasets to show off the scaling of the package, but I recognize that this makes it tough for replicators. Is there any way I can make it easier for you to run the benchmarks?

Similarly, please don't hesitate to let me know if I can make it easier for you to sign off on any of the other steps.

Best, and many thanks for your help, Ben

cjbarrie commented 1 year ago

Thanks for this, Ben. I'm now happy with how things stand

wincowgerDEV commented 1 year ago

@beniaminogreen, I am trying to run through the examples a last time to see how the update works. When I run your example at the readme.md: https://github.com/beniaminogreen/zoomerjoin

corpus_1 doesn't appear to be part of your package. Could you use dime_data to split corpus 1 and 2 for the user so that they can get up and running without having to guess how they should be split? I see you have done that in the vignette but I think the main readme is a pretty important place to streamline the workflow for users too.

Other than that the tutorials seem to confirm how I would expect the package to function now and for the most part the benchmarks are similar. I think it is fine that your benchmarks take a long time to run, good benchmark tests always do.

beniaminogreen commented 1 year ago

Good catch. Corpus 1 and 2 were created in one of the vignettes, but I also just displayed the code used to generate them in the package's README. This should be visible on the github now, and visible on the package site in the next few minutes after the deployment finishes.

wincowgerDEV commented 1 year ago

Sounds good, I checked all the boxes so I think you're good to publish.

beniaminogreen commented 1 year ago

Fantastic, thanks both for your help!

beniaminogreen commented 1 year ago

@editorialbot commands

editorialbot commented 1 year ago

Hello @beniaminogreen, here are the things you can ask me to do:


# List all available commands
@editorialbot commands

# Get a list of all editors's GitHub handles
@editorialbot list editors

# Check the references of the paper for missing DOIs
@editorialbot check references

# Perform checks on the repository
@editorialbot check repository

# Adds a checklist for the reviewer using this command
@editorialbot generate my checklist

# Set a value for branch
@editorialbot set joss-paper as branch

# Generates the pdf paper
@editorialbot generate pdf

# Generates a LaTeX preprint file
@editorialbot generate preprint

# Get a link to the complete list of reviewers
@editorialbot list reviewers
beniaminogreen commented 1 year ago

Hi, @samhforbes. I think that both of my reviewers have completed their checklists. Trying to understand what the next steps are on my part: should I create an archived release at this point, or would this be getting ahead of things?

Best, Ben

samhforbes commented 1 year ago

Thanks to both @wincowgerDEV and @cjbarrie for your thoughtful reviews

samhforbes commented 1 year ago

@beniaminogreen I'll do some final checks and then bring the post review process. I'll likely be a little slow as I have covid at the moment, but will check in soon

beniaminogreen commented 1 year ago

Thanks for your help and sorry to hear about the COVID! Please take your time.

Best, Ben

samhforbes commented 1 year ago

Post-Review Checklist for Editor and Authors

Additional Author Tasks After Review is Complete

Editor Tasks Prior to Acceptance

samhforbes commented 1 year ago

@editorialbot generate pdf

editorialbot commented 1 year ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

samhforbes commented 1 year ago

@editorialbot check references

editorialbot commented 1 year ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1109/sequen.1997.666900 is OK
- 10.1017/CBO9781139924801 is OK
- 10.1093/bioinformatics/btz354 is OK
- 10.1145/997817.997857 is OK
- 10.1017/pan.2021.38 is OK
- 10.32614/RJ-2014-011 is OK
- 10.2139/ssrn.3214172 is OK
- 10.1080/01621459.1969.10501049 is OK

MISSING DOIs

- None

INVALID DOIs

- None
samhforbes commented 1 year ago

Hi @beniaminogreen I'm satisfied that everything has been addressed, and I'm really impressed with the package, well done. You can see above I have created a post-review checklist, please work your way through those items, and then I'm ready to recommend acceptance.

beniaminogreen commented 1 year ago

Thanks for your kind words about the package! I have just completed the checklist below:

Version Number: v0.1.0 Zenodo Release: 10.5281/zenodo.8370652

DOI

Best and thanks for your help, Ben

samhforbes commented 1 year ago

@editorialbot set v0.1.0 as version

editorialbot commented 1 year ago

Done! version is now v0.1.0

samhforbes commented 1 year ago

@editorialbot set 10.5281/zenodo.8370652 as archive

editorialbot commented 1 year ago

Done! archive is now 10.5281/zenodo.8370652

samhforbes commented 1 year ago

@editorialbot recommend-accept

editorialbot commented 1 year ago
Attempting dry run of processing paper acceptance...
editorialbot commented 1 year ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1109/sequen.1997.666900 is OK
- 10.1017/CBO9781139924801 is OK
- 10.1093/bioinformatics/btz354 is OK
- 10.1145/997817.997857 is OK
- 10.1017/pan.2021.38 is OK
- 10.32614/RJ-2014-011 is OK
- 10.2139/ssrn.3214172 is OK
- 10.1080/01621459.1969.10501049 is OK

MISSING DOIs

- None

INVALID DOIs

- None
editorialbot commented 1 year ago

The paper's PDF and metadata files generation produced some warnings that could prevent the final paper from being published. Please fix them before the end of the review process.

citation rextednr not found
editorialbot commented 1 year ago

:warning: Error preparing paper acceptance. The generated XML metadata file is invalid.

IDREFS attribute rid references an unknown ID "ref-rextednr"
beniaminogreen commented 1 year ago

on it.