editorialbot commented 1 year ago

Submitting author: !--author-handle-->@beniaminogreen@samhforbes<!--end-editor-- Reviewers: @cjbarrie, @wincowgerDEV Archive: 10.5281/zenodo.8370652

Status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/b68ba4bfbce52e56585aea54a5bcba54"><img src="https://joss.theoj.org/papers/b68ba4bfbce52e56585aea54a5bcba54/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/b68ba4bfbce52e56585aea54a5bcba54/status.svg)](https://joss.theoj.org/papers/b68ba4bfbce52e56585aea54a5bcba54)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@cjbarrie & @wincowgerDEV, your review will be checklist based. Each of you will have a separate checklist that you should update when carrying out your review. First of all you need to run this command in a separate comment to create the checklist:

@editorialbot generate my checklist

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @samhforbes know.

✨ Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest ✨

Checklists

📝 Checklist for @cjbarrie

📝 Checklist for @wincowgerDEV

editorialbot commented 1 year ago

Hello humans, I'm @editorialbot, a robot that can help you with some common editorial tasks.

For a list of things I can do to help you, just type:

@editorialbot commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@editorialbot generate pdf

editorialbot commented 1 year ago

Software report:

github.com/AlDanial/cloc v 1.88  T=0.21 s (211.6 files/s, 24419.4 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Markdown                         6            239              0            960
HTML                             1             12              2            755
R                               18            189            784            706
Rust                             6            125              9            479
JSON                             1              0              0            189
YAML                             6             35             12            183
TeX                              1              0              0            141
Rmd                              3             73            205             55
TOML                             1              2              0             19
C                                1              2              2              4
Bourne Shell                     1              1              8              2
-------------------------------------------------------------------------------
SUM:                            45            678           1022           3493
-------------------------------------------------------------------------------

gitinspector failed to run statistical information for the repository

editorialbot commented 1 year ago

Wordcount for paper.md is 1491

editorialbot commented 1 year ago

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1109/sequen.1997.666900 is OK
- 10.1017/CBO9781139924801 is OK
- 10.1093/bioinformatics/btz354 is OK
- 10.1145/997817.997857 is OK
- 10.1017/pan.2021.38 is OK
- 10.32614/RJ-2014-011 is OK
- 10.2139/ssrn.3214172 is OK
- 10.1080/01621459.1969.10501049 is OK

MISSING DOIs

- None

INVALID DOIs

- None

editorialbot commented 1 year ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

samhforbes commented 1 year ago

👋🏼 @beniaminogreen, @cjbarrie, @wincowgerDEV this is the review thread for the paper. All of our communications will happen here from now on.

As a reviewer, the first step is to create a checklist for your review by entering

@editorialbot generate my checklist

as the top of a new comment in this thread.

These checklists contain the JOSS requirements. As you go over the submission, please check any items that you feel have been satisfied. The first comment in this thread also contains links to the JOSS reviewer guidelines.

The JOSS review is different from most other journals. Our goal is to work with the authors to help them meet our criteria instead of merely passing judgment on the submission. As such, the reviewers are encouraged to submit issues and pull requests on the software repository. Please also feel free to comment and ask questions on this thread. In my experience, it is better to post comments/questions/suggestions as you come across them instead of waiting until you've reviewed the entire package.

We aim for reviews to be completed within about 2-4 weeks. Please let me know if any of you require some more time. We can also use EditorialBot (our bot) to set automatic reminders if you know you'll be away for a known period of time.

Please feel free to ping me if you have any questions/concerns.

cjbarrie commented 1 year ago

Review checklist for @cjbarrie

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the https://github.com/beniaminogreen/zoomerjoin?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@beniaminogreen) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
[x] Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines
[x] Data sharing: If the paper contains original data, data are accessible to the reviewers. If the paper contains no original data, please check this item.
[x] Reproducibility: If the paper contains original results, results are entirely reproducible by reviewers. If the paper contains no original results, please check this item.
[x] Human and animal research: If the paper contains original data research on humans subjects or animals, does it comply with JOSS's human participants research policy and/or animal research policy? If the paper contains no such data, please check this item.

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Does the paper have a section titled 'Statement of need' that clearly states what problems the software is designed to solve, who the target audience is, and its relation to other work?
[x] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[x] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[x] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

cjbarrie commented 1 year ago

This is a very good package. Easy to use and well written. There are just a few small changes I would recommend.

You do not provide the code you used to arrive at the benchmarks. It would be helpful to have this to test the claims of the article.
You do not attach data in the package that you use (corpus_1 and corpus_2) to demonstrate the package functionality. I think some version of this could be attached within R package file size constraints. If not, just provide a link as Wickham has done for e.g., babynames package in the past.
It would be helpful to have vignette documentation for the other package functions. The documentation remains quite minimal at the moment.
There are no Community guidelines I can see on the Github repo.

Additionally, some small typos in the article:

In Summary section: "typically runs in runs in linear"
In Statement of Need section: " is probabilistic a probabilistic algorithm"
In Other Functionalities section: "implementation the probabilistic record-linkage"
In Other Functionalities section: "to pairs to units"

For validating the package functionality, I did the following for now. It would be helpful if some version of this were incorporated in the documentation so users can test functionality easily:


library(dplyr)

url <- "https://raw.githubusercontent.com/beniaminogreen/zoomerjoin/main/vignettes/bonica.csv"
data <- read.csv(url(url))

set.seed(123L)
corpus_1 <- data %>%
  sample_n(500000) %>%
  select(x) %>%
  rename(field = x)

corpus_2 <- data %>%
  sample_n(500000) %>%
  select(x) %>%
  rename(field = x)

start_time <- Sys.time()
join_out <- jaccard_inner_join(corpus_1, corpus_2,
                               by = "field", n_gram_width=6,
                               n_bands=20, band_width=6, threshold = .8)
print(Sys.time() - start_time)

beniaminogreen commented 1 year ago

Hi,

Thanks for your detailed feedback and your kind words about the package.

Writing in response to the four points you raise:

1) Thanks for flagging the lack of the benchmarking code. I originally included this in the package, but it was accidentally deleted at some stage. I will include it at the top of the benchmarking vignette so that it's easily available to you and other users. Unfortunately, the benchmarking takes something like 6 hours to run on a 8-core server. Most of this time is spent running the fuzzyjoin code, so it might be practical to only run the zoomerjoin code to verify its performance. Alternatively, I can reduce the size of the datasets or the number of trials to make sure it is replicable in a reasonable amount of time.

2) I have the corpus data in the vignettes file, but I am not sure if this is the right place. I will read through the R package book and make sure that I put the files in an appropriate place, and that the files are visible to users. I also very much like the code snippet that you provided, so I will include it or something similar.

3) I agree that the package should have better documentation for the Euclidean distance joins, and the fuzzy string grouping, which I will add in the next few days.

4) I have included a contributing guide / community code of conduct in the contributing.md file in the root of the repo. Not sure if this is the right place, or the right format for the guidelines. I'll change it in the next commit if this does not meet the requirements.

I'll also make sure to fix the typos you flag in the paper.

Best and many thanks, Ben

beniaminogreen commented 1 year ago

@editorialbot generate pdf

editorialbot commented 1 year ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

beniaminogreen commented 1 year ago

Writing to share some of the updates to the package over the past few days. In addition to fixing typos and fixing a bug in the euclidean joining code, I have:

1) Updated the guided tour vingette to show off the jaccard_string_group and jaccard_curve functions, as well as adding a very short vingette on joining using the euclidean distance.

2) added the DIME / Bonica data so it is packaged with the R repository, which makes the functionality of the package easier to experiment. This change is reflected in the README and vignettes.

3) The benchmarking vingette now contains the code that is used to generate the benchmarks (although it isn't run every time). As mentioned before, this code takes several hours to run due to the large size of the datasets used.

Because of this, I have also uploaded this gist which runs the same benchmarking code over a smaller set of inputs to show the difference in scaling between the two methods for smaller datasets. It should run in ~20 mins on a modern laptop.

wincowgerDEV commented 1 year ago

Review checklist for @wincowgerDEV

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the https://github.com/beniaminogreen/zoomerjoin?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@beniaminogreen) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
[x] Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines
[x] Data sharing: If the paper contains original data, data are accessible to the reviewers. If the paper contains no original data, please check this item.
[x] Reproducibility: If the paper contains original results, results are entirely reproducible by reviewers. If the paper contains no original results, please check this item.
[x] Human and animal research: If the paper contains original data research on humans subjects or animals, does it comply with JOSS's human participants research policy and/or animal research policy? If the paper contains no such data, please check this item.

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Does the paper have a section titled 'Statement of need' that clearly states what problems the software is designed to solve, who the target audience is, and its relation to other work?
[x] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[x] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[x] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

wincowgerDEV commented 1 year ago

@beniaminogreen, I ran into an issue with the installation and posted here: https://github.com/beniaminogreen/zoomerjoin/issues/70

wincowgerDEV commented 1 year ago

@beniaminogreen recommendation on line 77 of the manuscript, I don't think you want to claim that your package never creates false positives. True positive matches in character datasets are often subjective and may be complex (beyond the capabilities of jacard similarity), it may confuse readers when they do not get the expected output.

wincowgerDEV commented 1 year ago

@beniaminogreen, I recommend adding some code examples in text to demonstrate your points, it will help readers understand which function to use to achieve which outcomes you are talking about.

wincowgerDEV commented 1 year ago

@beniaminogreen, it isn't clear where the code for producing the euclidean synthetic dataset is described in the manuscript. I think there should be a link in the manuscript to where the code is that produced the figure.

beniaminogreen commented 1 year ago

I agree with the comment that "no false positives" might mislead users as to what zoomerjoin can accomplish. I'd still like to flag the fact that zoomerjoin will never identify units that are less similar than the threshold the user selects as matches, but I'll see if there's a way to say this that sets expectations better.

The code for the synthetic dataset and benchmarking can be found in this vignette, and I'll include a link in the paper. I also like the suggestion of adding code examples, so I'll add a small example of joining using the Euclidean and Jaccard distances to the article.

I thought that including a contrubuting.md file was enough to satisfy the JOSS requirements, but since both of you have raised this as a concern, I will write up a more detailed set of instructions on contributing or reporting bugs.

Best, Ben

beniaminogreen commented 1 year ago

@editorialbot generate pdf

editorialbot commented 1 year ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

beniaminogreen commented 1 year ago

Updating the article to make it responsive to comments above. Specifically: have added code for the synthetic dataset, linked to benchmarking vignette, and added an example use of euclidean_inner_join. Also, have removed sentence about no false positives because I agree it could set the wrong expectation for people considering the package.

Added contributing instructions in a separate update to the repo.

Best, Ben

beniaminogreen commented 1 year ago

Hi All,

Thanks for your detailed feedback about the package.

In response to some of the comments you've given, I have made the rust code in the package capture R's random seed, so you can set the seed and make sure that all the results are repicable. This helped me catch a locking issue with some of the rust code.

I've bundled a version of the dime dataset with the package so that users can start experimenting with the joining function immediately after installing the package.

It looks like both of you are having difficulty replicating some of the performance claims in the paper. Ideally, I would like to include benchmarks on large datasets to show off the scaling of the package, but I recognize that this makes it tough for replicators. Is there any way I can make it easier for you to run the benchmarks?

Similarly, please don't hesitate to let me know if I can make it easier for you to sign off on any of the other steps.

Best, and many thanks for your help, Ben

cjbarrie commented 1 year ago

Thanks for this, Ben. I'm now happy with how things stand

wincowgerDEV commented 1 year ago

@beniaminogreen, I am trying to run through the examples a last time to see how the update works. When I run your example at the readme.md: https://github.com/beniaminogreen/zoomerjoin

corpus_1 doesn't appear to be part of your package. Could you use dime_data to split corpus 1 and 2 for the user so that they can get up and running without having to guess how they should be split? I see you have done that in the vignette but I think the main readme is a pretty important place to streamline the workflow for users too.

Other than that the tutorials seem to confirm how I would expect the package to function now and for the most part the benchmarks are similar. I think it is fine that your benchmarks take a long time to run, good benchmark tests always do.

beniaminogreen commented 1 year ago

Good catch. Corpus 1 and 2 were created in one of the vignettes, but I also just displayed the code used to generate them in the package's README. This should be visible on the github now, and visible on the package site in the next few minutes after the deployment finishes.

wincowgerDEV commented 1 year ago

Sounds good, I checked all the boxes so I think you're good to publish.

beniaminogreen commented 1 year ago

Fantastic, thanks both for your help!

beniaminogreen commented 1 year ago

@editorialbot commands

editorialbot commented 1 year ago

Hello @beniaminogreen, here are the things you can ask me to do:


# List all available commands
@editorialbot commands

# Get a list of all editors's GitHub handles
@editorialbot list editors

# Check the references of the paper for missing DOIs
@editorialbot check references

# Perform checks on the repository
@editorialbot check repository

# Adds a checklist for the reviewer using this command
@editorialbot generate my checklist

# Set a value for branch
@editorialbot set joss-paper as branch

# Generates the pdf paper
@editorialbot generate pdf

# Generates a LaTeX preprint file
@editorialbot generate preprint

# Get a link to the complete list of reviewers
@editorialbot list reviewers

beniaminogreen commented 1 year ago

Hi, @samhforbes. I think that both of my reviewers have completed their checklists. Trying to understand what the next steps are on my part: should I create an archived release at this point, or would this be getting ahead of things?

Best, Ben

samhforbes commented 1 year ago

Thanks to both @wincowgerDEV and @cjbarrie for your thoughtful reviews

samhforbes commented 1 year ago

@beniaminogreen I'll do some final checks and then bring the post review process. I'll likely be a little slow as I have covid at the moment, but will check in soon

beniaminogreen commented 1 year ago

Thanks for your help and sorry to hear about the COVID! Please take your time.

Best, Ben

samhforbes commented 1 year ago

Post-Review Checklist for Editor and Authors

Additional Author Tasks After Review is Complete

[ ] Double check authors and affiliations (including ORCIDs)
[ ] Make a release of the software with the latest changes from the review and post the version number here. This is the version that will be used in the JOSS paper.
[ ] Archive the release on Zenodo/figshare/etc and post the DOI here.
[ ] Make sure that the title and author list (including ORCIDs) in the archive match those in the JOSS paper.
[ ] Make sure that the license listed for the archive is the same as the software license.

Editor Tasks Prior to Acceptance

[x] Read the text of the paper and offer comments/corrections (as either a list or a PR)
[x] Check the references in the paper for corrections (e.g. capitalization)
[x] Check that the archive title, author list, version tag, and the license are correct
[x] Set archive DOI with @editorialbot set <DOI here> as archive
[x] Set version with @editorialbot set <version here> as version
[x] Double check rendering of paper with @editorialbot generate pdf
[x] Specifically check the references with @editorialbot check references and ask author(s) to update as needed
[x] Recommend acceptance with @editorialbot recommend-accept

samhforbes commented 1 year ago

@editorialbot generate pdf

editorialbot commented 1 year ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

samhforbes commented 1 year ago

@editorialbot check references

editorialbot commented 1 year ago

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1109/sequen.1997.666900 is OK
- 10.1017/CBO9781139924801 is OK
- 10.1093/bioinformatics/btz354 is OK
- 10.1145/997817.997857 is OK
- 10.1017/pan.2021.38 is OK
- 10.32614/RJ-2014-011 is OK
- 10.2139/ssrn.3214172 is OK
- 10.1080/01621459.1969.10501049 is OK

MISSING DOIs

- None

INVALID DOIs

- None

samhforbes commented 1 year ago

Hi @beniaminogreen I'm satisfied that everything has been addressed, and I'm really impressed with the package, well done. You can see above I have created a post-review checklist, please work your way through those items, and then I'm ready to recommend acceptance.

beniaminogreen commented 1 year ago

Thanks for your kind words about the package! I have just completed the checklist below:

[x] Double check authors and affiliations (including ORCIDs)
[X] Make a release of the software with the latest changes from the review and post the version number here. This is the version that will be used in the JOSS paper.
[X] Archive the release on Zenodo/figshare/etc and post the DOI here.
[x] Make sure that the title and author list (including ORCIDs) in the archive match those in the JOSS paper.
[x] Make sure that the license listed for the archive is the same as the software license.

Version Number: v0.1.0 Zenodo Release: 10.5281/zenodo.8370652

Best and thanks for your help, Ben

samhforbes commented 1 year ago

@editorialbot set v0.1.0 as version

editorialbot commented 1 year ago

Done! version is now v0.1.0

samhforbes commented 1 year ago

@editorialbot set 10.5281/zenodo.8370652 as archive

editorialbot commented 1 year ago

Done! archive is now 10.5281/zenodo.8370652

samhforbes commented 1 year ago

@editorialbot recommend-accept

editorialbot commented 1 year ago

Attempting dry run of processing paper acceptance...

editorialbot commented 1 year ago

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1109/sequen.1997.666900 is OK
- 10.1017/CBO9781139924801 is OK
- 10.1093/bioinformatics/btz354 is OK
- 10.1145/997817.997857 is OK
- 10.1017/pan.2021.38 is OK
- 10.32614/RJ-2014-011 is OK
- 10.2139/ssrn.3214172 is OK
- 10.1080/01621459.1969.10501049 is OK

MISSING DOIs

- None

INVALID DOIs

- None

editorialbot commented 1 year ago

The paper's PDF and metadata files generation produced some warnings that could prevent the final paper from being published. Please fix them before the end of the review process.

citation rextednr not found

editorialbot commented 1 year ago

:warning: Error preparing paper acceptance. The generated XML metadata file is invalid.

IDREFS attribute rid references an unknown ID "ref-rextednr"

beniaminogreen commented 1 year ago

on it.

openjournals / joss-reviews

[REVIEW]: Zoomerjoin: Superlatively Fast Fuzzy-Joins #5693

Status

Reviewer instructions & questions

Checklists

Review checklist for @cjbarrie

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

Review checklist for @wincowgerDEV

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

Post-Review Checklist for Editor and Authors

Additional Author Tasks After Review is Complete

Editor Tasks Prior to Acceptance