Closed editorialbot closed 1 year ago
Hello humans, I'm @editorialbot, a robot that can help you with some common editorial tasks.
For a list of things I can do to help you, just type:
@editorialbot commands
For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:
@editorialbot generate pdf
Software report:
github.com/AlDanial/cloc v 1.88 T=0.21 s (211.6 files/s, 24419.4 lines/s)
-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
Markdown 6 239 0 960
HTML 1 12 2 755
R 18 189 784 706
Rust 6 125 9 479
JSON 1 0 0 189
YAML 6 35 12 183
TeX 1 0 0 141
Rmd 3 73 205 55
TOML 1 2 0 19
C 1 2 2 4
Bourne Shell 1 1 8 2
-------------------------------------------------------------------------------
SUM: 45 678 1022 3493
-------------------------------------------------------------------------------
gitinspector failed to run statistical information for the repository
Wordcount for paper.md
is 1491
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):
OK DOIs
- 10.1109/sequen.1997.666900 is OK
- 10.1017/CBO9781139924801 is OK
- 10.1093/bioinformatics/btz354 is OK
- 10.1145/997817.997857 is OK
- 10.1017/pan.2021.38 is OK
- 10.32614/RJ-2014-011 is OK
- 10.2139/ssrn.3214172 is OK
- 10.1080/01621459.1969.10501049 is OK
MISSING DOIs
- None
INVALID DOIs
- None
:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:
👋🏼 @beniaminogreen, @cjbarrie, @wincowgerDEV this is the review thread for the paper. All of our communications will happen here from now on.
As a reviewer, the first step is to create a checklist for your review by entering
@editorialbot generate my checklist
as the top of a new comment in this thread.
These checklists contain the JOSS requirements. As you go over the submission, please check any items that you feel have been satisfied. The first comment in this thread also contains links to the JOSS reviewer guidelines.
The JOSS review is different from most other journals. Our goal is to work with the authors to help them meet our criteria instead of merely passing judgment on the submission. As such, the reviewers are encouraged to submit issues and pull requests on the software repository. Please also feel free to comment and ask questions on this thread. In my experience, it is better to post comments/questions/suggestions as you come across them instead of waiting until you've reviewed the entire package.
We aim for reviews to be completed within about 2-4 weeks. Please let me know if any of you require some more time. We can also use EditorialBot (our bot) to set automatic reminders if you know you'll be away for a known period of time.
Please feel free to ping me if you have any questions/concerns.
This is a very good package. Easy to use and well written. There are just a few small changes I would recommend.
babynames
package in the past. Additionally, some small typos in the article:
For validating the package functionality, I did the following for now. It would be helpful if some version of this were incorporated in the documentation so users can test functionality easily:
library(dplyr)
url <- "https://raw.githubusercontent.com/beniaminogreen/zoomerjoin/main/vignettes/bonica.csv"
data <- read.csv(url(url))
set.seed(123L)
corpus_1 <- data %>%
sample_n(500000) %>%
select(x) %>%
rename(field = x)
corpus_2 <- data %>%
sample_n(500000) %>%
select(x) %>%
rename(field = x)
start_time <- Sys.time()
join_out <- jaccard_inner_join(corpus_1, corpus_2,
by = "field", n_gram_width=6,
n_bands=20, band_width=6, threshold = .8)
print(Sys.time() - start_time)
Hi,
Thanks for your detailed feedback and your kind words about the package.
Writing in response to the four points you raise:
1) Thanks for flagging the lack of the benchmarking code. I originally included this in the package, but it was accidentally deleted at some stage. I will include it at the top of the benchmarking vignette so that it's easily available to you and other users. Unfortunately, the benchmarking takes something like 6 hours to run on a 8-core server. Most of this time is spent running the fuzzyjoin code, so it might be practical to only run the zoomerjoin code to verify its performance. Alternatively, I can reduce the size of the datasets or the number of trials to make sure it is replicable in a reasonable amount of time.
2) I have the corpus data in the vignettes file, but I am not sure if this is the right place. I will read through the R package book and make sure that I put the files in an appropriate place, and that the files are visible to users. I also very much like the code snippet that you provided, so I will include it or something similar.
3) I agree that the package should have better documentation for the Euclidean distance joins, and the fuzzy string grouping, which I will add in the next few days.
4) I have included a contributing guide / community code of conduct in the contributing.md file in the root of the repo. Not sure if this is the right place, or the right format for the guidelines. I'll change it in the next commit if this does not meet the requirements.
I'll also make sure to fix the typos you flag in the paper.
Best and many thanks, Ben
@editorialbot generate pdf
:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:
Writing to share some of the updates to the package over the past few days. In addition to fixing typos and fixing a bug in the euclidean joining code, I have:
1) Updated the guided tour vingette to show off the jaccard_string_group
and jaccard_curve
functions, as well as adding a very short vingette on joining using the euclidean distance.
2) added the DIME / Bonica data so it is packaged with the R repository, which makes the functionality of the package easier to experiment. This change is reflected in the README and vignettes.
3) The benchmarking vingette now contains the code that is used to generate the benchmarks (although it isn't run every time). As mentioned before, this code takes several hours to run due to the large size of the datasets used.
Because of this, I have also uploaded this gist which runs the same benchmarking code over a smaller set of inputs to show the difference in scaling between the two methods for smaller datasets. It should run in ~20 mins on a modern laptop.
@beniaminogreen, I ran into an issue with the installation and posted here: https://github.com/beniaminogreen/zoomerjoin/issues/70
@beniaminogreen recommendation on line 77 of the manuscript, I don't think you want to claim that your package never creates false positives. True positive matches in character datasets are often subjective and may be complex (beyond the capabilities of jacard similarity), it may confuse readers when they do not get the expected output.
@beniaminogreen, I recommend adding some code examples in text to demonstrate your points, it will help readers understand which function to use to achieve which outcomes you are talking about.
@beniaminogreen, it isn't clear where the code for producing the euclidean synthetic dataset is described in the manuscript. I think there should be a link in the manuscript to where the code is that produced the figure.
I agree with the comment that "no false positives" might mislead users as to what zoomerjoin can accomplish. I'd still like to flag the fact that zoomerjoin will never identify units that are less similar than the threshold the user selects as matches, but I'll see if there's a way to say this that sets expectations better.
The code for the synthetic dataset and benchmarking can be found in this vignette, and I'll include a link in the paper. I also like the suggestion of adding code examples, so I'll add a small example of joining using the Euclidean and Jaccard distances to the article.
I thought that including a contrubuting.md file was enough to satisfy the JOSS requirements, but since both of you have raised this as a concern, I will write up a more detailed set of instructions on contributing or reporting bugs.
Best, Ben
@editorialbot generate pdf
:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:
Updating the article to make it responsive to comments above. Specifically: have added code for the synthetic dataset, linked to benchmarking vignette, and added an example use of euclidean_inner_join
. Also, have removed sentence about no false positives because I agree it could set the wrong expectation for people considering the package.
Added contributing instructions in a separate update to the repo.
Best, Ben
Hi All,
Thanks for your detailed feedback about the package.
In response to some of the comments you've given, I have made the rust code in the package capture R's random seed, so you can set the seed and make sure that all the results are repicable. This helped me catch a locking issue with some of the rust code.
I've bundled a version of the dime dataset with the package so that users can start experimenting with the joining function immediately after installing the package.
It looks like both of you are having difficulty replicating some of the performance claims in the paper. Ideally, I would like to include benchmarks on large datasets to show off the scaling of the package, but I recognize that this makes it tough for replicators. Is there any way I can make it easier for you to run the benchmarks?
Similarly, please don't hesitate to let me know if I can make it easier for you to sign off on any of the other steps.
Best, and many thanks for your help, Ben
Thanks for this, Ben. I'm now happy with how things stand
@beniaminogreen, I am trying to run through the examples a last time to see how the update works. When I run your example at the readme.md: https://github.com/beniaminogreen/zoomerjoin
corpus_1 doesn't appear to be part of your package. Could you use dime_data to split corpus 1 and 2 for the user so that they can get up and running without having to guess how they should be split? I see you have done that in the vignette but I think the main readme is a pretty important place to streamline the workflow for users too.
Other than that the tutorials seem to confirm how I would expect the package to function now and for the most part the benchmarks are similar. I think it is fine that your benchmarks take a long time to run, good benchmark tests always do.
Good catch. Corpus 1 and 2 were created in one of the vignettes, but I also just displayed the code used to generate them in the package's README. This should be visible on the github now, and visible on the package site in the next few minutes after the deployment finishes.
Sounds good, I checked all the boxes so I think you're good to publish.
Fantastic, thanks both for your help!
@editorialbot commands
Hello @beniaminogreen, here are the things you can ask me to do:
# List all available commands
@editorialbot commands
# Get a list of all editors's GitHub handles
@editorialbot list editors
# Check the references of the paper for missing DOIs
@editorialbot check references
# Perform checks on the repository
@editorialbot check repository
# Adds a checklist for the reviewer using this command
@editorialbot generate my checklist
# Set a value for branch
@editorialbot set joss-paper as branch
# Generates the pdf paper
@editorialbot generate pdf
# Generates a LaTeX preprint file
@editorialbot generate preprint
# Get a link to the complete list of reviewers
@editorialbot list reviewers
Hi, @samhforbes. I think that both of my reviewers have completed their checklists. Trying to understand what the next steps are on my part: should I create an archived release at this point, or would this be getting ahead of things?
Best, Ben
Thanks to both @wincowgerDEV and @cjbarrie for your thoughtful reviews
@beniaminogreen I'll do some final checks and then bring the post review process. I'll likely be a little slow as I have covid at the moment, but will check in soon
Thanks for your help and sorry to hear about the COVID! Please take your time.
Best, Ben
@editorialbot set <DOI here> as archive
@editorialbot set <version here> as version
@editorialbot generate pdf
@editorialbot check references
and ask author(s) to update as needed@editorialbot recommend-accept
@editorialbot generate pdf
:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:
@editorialbot check references
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):
OK DOIs
- 10.1109/sequen.1997.666900 is OK
- 10.1017/CBO9781139924801 is OK
- 10.1093/bioinformatics/btz354 is OK
- 10.1145/997817.997857 is OK
- 10.1017/pan.2021.38 is OK
- 10.32614/RJ-2014-011 is OK
- 10.2139/ssrn.3214172 is OK
- 10.1080/01621459.1969.10501049 is OK
MISSING DOIs
- None
INVALID DOIs
- None
Hi @beniaminogreen I'm satisfied that everything has been addressed, and I'm really impressed with the package, well done. You can see above I have created a post-review checklist, please work your way through those items, and then I'm ready to recommend acceptance.
Thanks for your kind words about the package! I have just completed the checklist below:
Version Number: v0.1.0 Zenodo Release: 10.5281/zenodo.8370652
Best and thanks for your help, Ben
@editorialbot set v0.1.0 as version
Done! version is now v0.1.0
@editorialbot set 10.5281/zenodo.8370652 as archive
Done! archive is now 10.5281/zenodo.8370652
@editorialbot recommend-accept
Attempting dry run of processing paper acceptance...
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):
OK DOIs
- 10.1109/sequen.1997.666900 is OK
- 10.1017/CBO9781139924801 is OK
- 10.1093/bioinformatics/btz354 is OK
- 10.1145/997817.997857 is OK
- 10.1017/pan.2021.38 is OK
- 10.32614/RJ-2014-011 is OK
- 10.2139/ssrn.3214172 is OK
- 10.1080/01621459.1969.10501049 is OK
MISSING DOIs
- None
INVALID DOIs
- None
The paper's PDF and metadata files generation produced some warnings that could prevent the final paper from being published. Please fix them before the end of the review process.
citation rextednr not found
:warning: Error preparing paper acceptance. The generated XML metadata file is invalid.
IDREFS attribute rid references an unknown ID "ref-rextednr"
on it.
Submitting author: !--author-handle-->@beniaminogreen<!--end-author-handle-- (Beniamino Green) Repository: https://github.com/beniaminogreen/zoomerjoin Branch with paper.md (empty if default branch): joss Version: v0.1.0 Editor: !--editor-->@samhforbes<!--end-editor-- Reviewers: @cjbarrie, @wincowgerDEV Archive: 10.5281/zenodo.8370652
Status
Status badge code:
Reviewers and authors:
Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)
Reviewer instructions & questions
@cjbarrie & @wincowgerDEV, your review will be checklist based. Each of you will have a separate checklist that you should update when carrying out your review. First of all you need to run this command in a separate comment to create the checklist:
The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @samhforbes know.
✨ Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest ✨
Checklists
📝 Checklist for @cjbarrie
📝 Checklist for @wincowgerDEV