[REVIEW]: Talisman: a JavaScript archive of fuzzy matching, information retrieval and record linkage building blocks

whedon commented 4 years ago

Submitting author: @Yomguithereal (Guillaume Plique) Repository: https://github.com/Yomguithereal/talisman Version: 1.1.3-zenodo Editor: @kakiac Reviewer: @Fil, @atanikan Archive: 10.5281/zenodo.4247869

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

Status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/4f169de181f43c9dbf6b368dab7cea52"><img src="https://joss.theoj.org/papers/4f169de181f43c9dbf6b368dab7cea52/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/4f169de181f43c9dbf6b368dab7cea52/status.svg)](https://joss.theoj.org/papers/4f169de181f43c9dbf6b368dab7cea52)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@Fil & @atanikan, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

Make sure you're logged in to your GitHub account
Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @kakiac know.

✨ Please try and complete your review in the next six weeks ✨

Review checklist for @Fil

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@Yomguithereal) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[x] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[x] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

Review checklist for @atanikan

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@Yomguithereal) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[x] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[x] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

whedon commented 4 years ago

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @Fil, @atanikan it looks like you're currently assigned to review this paper :tada:.

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

:star: Important :star:

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

whedon commented 4 years ago

Reference check summary:

OK DOIs

- 10.1109/MEDITEC.2016.7835390 is OK
- 10.1145/358027.358048 is OK
- 10.1109/SEQUEN.1997.666900 is OK
- 10.1145/1963405.1963487 is OK

MISSING DOIs

- https://doi.org/10.1007/3-540-45735-6_24 may be missing for title: String matching with metric trees using an approximate distance
- https://doi.org/10.1145/363958.363994 may be missing for title: A technique for computer detection and correction of spelling errors
- https://doi.org/10.2307/1932409 may be missing for title: Measures of the amount of ecologic association between species
- https://doi.org/10.1080/00182494.1976.10112645 may be missing for title: Surname spellings and computerized record linkage
- https://doi.org/10.1111/j.1469-8137.1912.tb05611.x may be missing for title: The distribution of the flora in the alpine zone. 1
- https://doi.org/10.1080/01621459.1989.10478785 may be missing for title: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida
- https://doi.org/10.1002/sim.4780140510 may be missing for title: Probabilistic linkage of large public health data files
- https://doi.org/10.6028/nbs.sp.500-2 may be missing for title: Accessing individual records from personal data files using non-unique identifiers
- https://doi.org/10.1016/0022-2836(81)90087-5 may be missing for title: Identification of common molecular subsequences
- https://doi.org/10.2307/1532326 may be missing for title: FONEM: Un code de transcription phonétique pour la reconstitution automatique des familles saguenayennes
- https://doi.org/10.1109/itcc.2002.1000354 may be missing for title: Improving precision and recall for soundex retrieval
- https://doi.org/10.1145/2348828.2348830 may be missing for title: Hybrid matching algorithm for personal names
- https://doi.org/10.1002/(sici)1097-4571(199301)44:1<1::aid-asi1>3.0.co;2-1 may be missing for title: Stemming of French words based on grammatical categories
- https://doi.org/10.1002/(sici)1097-4571(1999)50:10<944::aid-asi9>3.0.co;2-q may be missing for title: A stemming procedure and stopword list for general French corpora
- https://doi.org/10.1108/eb026966 may be missing for title: A stemming algorithm for Latin text databases
- https://doi.org/10.1145/101306.101310 may be missing for title: Another stemmer
- https://doi.org/10.1162/coli.2006.32.4.485 may be missing for title: Unsupervised multilingual sentence boundary detection

INVALID DOIs

- None

kakiac commented 4 years ago

@whedon re-invite @atanikan as reviewer

whedon commented 4 years ago

@atanikan already has access.

kakiac commented 4 years ago

👋 @Fil many thanks for your comments - @yomguithereal can you let us know once you have addressed @Fil 's comments?

@atanikan let me know if you need any help with the reviewing process - the review criteria you need to follow are outlined here: https://joss.readthedocs.io/en/latest/review_criteria.html and a useful checklist is above (with comments and explanations here: https://joss.readthedocs.io/en/latest/review_checklist.html) feel free to check the boxes directly!

Many thanks all for all your hard work in improving the library and the paper,

Yomguithereal commented 4 years ago

@Yomguithereal can you let us know once you have addressed @Fil 's comments?

Sure, I will tell you when this is done. I first need to find a little bit of time to do so, hope there is no time limit on this :) Thanks for reviewing the library.

atanikan commented 4 years ago

@kakiac Thanks for the reminder :) @Yomguithereal The installation works. I checked some of your functions and they work well. Few minor suggestions, a) Personally I'd find it more intuitive to do var talis = require('talisman'); talis.metrics.cosine([1,2],[3,4]); over var cos = require('talisman/metrics/cosine'); cos([1,2],[3,4]); I guess the issue does speak to it.

b) A comparative benchmark study as mentioned here which can also be added to your documentation website would be super useful.

c) Mentioning somewhere in your documentation examples using "require" as well as ES6 "imports" would be useful for beginners

Yomguithereal commented 4 years ago

I guess the issue does speak to it.

Yes the issue is indeed about this point. The reason why this was implemented thusly, in the beginning, is because it was written at a time where tree-shaking bundler were not yet a thing and since the library is really heavy on its own, it was the only way to make sure people would only require pieces of it on a need-to basis. But I agree this is somewhat unnecessary when using the library in node.js

I will add a centralized endpoint in a near future but I need some time to be sure I can do so while supporting ES6 imports, CommonJS & TypeScript which is unfortunately always a drag.

A comparative benchmark study as mentioned here which can also be added to your documentation website would be super useful.

This issue (imprecisely) refer to the fact that you can use the library to benchmark several methods such as phonetic encoders against a database of, say, given names. I was not refering to performance benchmark against other similar libraries. Is that what you had in mind?

atanikan commented 4 years ago

I will add a centralized endpoint in a near future but I need some time to be sure I can do so while supporting ES6 imports, CommonJS & TypeScript which is unfortunately always a drag.

That's fair! I guess we can pass on it for now.

This issue (imprecisely) refer to the fact that you can use the library to benchmark several methods such as phonetic encoders against a database of, say, given names. I was not referring to performance benchmark against other similar libraries. Is that what you had in mind?

Then yes, I meant performance benchmarking with other similar libraries. I understand Talisman aims to be an archive/collection of several text-based search/information retrieval techniques. And not exactly a brand new algorithm. I however feel some benchmarking, be it in terms of accuracy or speed of execution by taking a large dataset as an example, may prove to be useful for wider use and a greater acceptance of your library.

kakiac commented 4 years ago

Thanks all!

@Yomguithereal, re this:

hope there is no time limit on this :)

No, please take your time to implement the changes as suggested by the reviewers - once you are happy (and the reviewers are happy 😃 ) we can move the submission forward to the next step!

labarba commented 4 years ago

👋 hi @Yomguithereal

It looks like you needed time to implement reviewer-requested mods. Can we have an update on the progress? If you need more time, just let us know—times are tough!

Yomguithereal commented 4 years ago

Hello @labarba. Sorry for the delay. I haven't progressed much but I should be able to find time to do so in september. Times indeed are tough :). Thanks for enquiring.

kakiac commented 4 years ago

@whedon remind @Yomguithereal in two weeks

whedon commented 4 years ago

Reminder set for @Yomguithereal in two weeks

whedon commented 4 years ago

:wave: @Yomguithereal, please update us on how things are progressing here.

kakiac commented 4 years ago

👋 @Yomguithereal - hope you are well! I am just following up to see how things are progressing with implementing the reviewer's suggestions? Let me know if there is something I can do to help our end :) Hope you are well 👍

Yomguithereal commented 4 years ago

Hello @kakiac. I should have some time thursday to act on those suggestions. So far I need to update the paper to change its framing a bit, add some references to similar tools and switch to a leaner bibliography (as opposed to the one containing all of the papers describing the implemented functions and methods). I also have some one related to the library's packaging. I would like to address those but later since it is very time consuming.

Yomguithereal commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

Yomguithereal commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

Yomguithereal commented 4 years ago

@atanikan I have added a note regarding module imports and require here.

Regarding:

Then yes, I meant performance benchmarking with other similar libraries. I understand Talisman aims to be an archive/collection of several text-based search/information retrieval techniques. And not exactly a brand new algorithm. I however feel some benchmarking, be it in terms of accuracy or speed of execution by taking a large dataset as an example, may prove to be useful for wider use and a greater acceptance of your library.

There exists some speed benchmarks regarding some functions of the library (notably Levenshtein distance, because this library was used as an experimental starting point for the creation of many very fast implementation of the distance in JavaScript) but for most of the implemented methods there is nothing to benchmark against because nobody implemented those functions.

Regarding precision/recall etc. the issue here is that the relevant databases that could enable such benchmarks don't exist yet because most of those methods were created as heuristics geared toward the resolution of particular problems and so it made no sense to most of their authors to compare them to others. This said, I believe there is a point to be made regarding some classes of functions present in the library such as phonetic algorithms geared towards proper names and their precision/recall. I personally intend to develop such a benchmark and I have started created a database that could be used for this purpose (along with another phonetic algorithm I am developing right now), but unfortunately this is an ongoing work which won't be finished anywhere soon.

This said, do you want me to, at least, try to compile some speed benchmarks regarding a subset of the library's functions that have some other JavaScript implementations?

Fil commented 4 years ago

My review is complete. Thank you @Yomguithereal!

atanikan commented 4 years ago

This said, do you want me to, at least, try to compile some speed benchmarks regarding a subset of the library's functions that have some other JavaScript implementations?

It's fine. I see the article does speak to the fact that one can benchmark if they would like. Since the article nor the library makes an outright claim to have the "fastest"/"best". I suppose we can ignore benchmarking stats for now. Appreciate the changes made in the documentation as well. @kakiac My review is complete. Thanks, @Yomguithereal. Fantastic job.

kakiac commented 4 years ago

@atanikan @Fil @Yomguithereal thank you so much for all your efforts and hard work on this! A fantastic submission :) I will now be initiating final checks and push through the last editorial steps for publication.

kakiac commented 4 years ago

Editor's "After reviewers recommend acceptance" checks:

[x] Get a new proof with the @whedon generate pdf command.
[x] Download the proof, check all references have DOIs, follow the links and check the references.
[x] Whedon can help check references with the command @whedon check references
[x] Proof-read the paper and ask authors to fix any remaining typos, badly formed citations, awkward wording, etc..
[x] Ask the author to make a tagged release (on github) and archive (on Zenodo or Figshare), and report the tagged version number and archive DOI in the review thread.
[x] Check the archive deposit has the correct metadata (title and author list), and request the author edit it if it doesn’t match the paper.
[x] Run @whedon set as archive.
[x] Run @whedon set <v1.x.x> as version if the version was updated.
[x] Run @whedon accept to generate the final proofs, which has Whedon notify the @openjournals/joss-eics team that the paper is ready for final processing.

kakiac commented 4 years ago

@whedon generate pdf

kakiac commented 4 years ago

@whedon check references

whedon commented 4 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

whedon commented 4 years ago

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1007/978-3-642-31164-2 is OK

MISSING DOIs

- 10.1111/j.1469-8137.1912.tb05611.x may be a valid DOI for title: The distribution of the flora in the alpine zone. 1
- 10.1007/0-387-69505-2 may be a valid DOI for title: Data quality and record linkage techniques

INVALID DOIs

- None

Yomguithereal commented 4 years ago

I pushed a commit adding the missing DOIs to the paper's bibliography.

kakiac commented 4 years ago

Thanks @Yomguithereal!

I have just finished with all the checks for this submission - if you could merge my pull request here: https://github.com/Yomguithereal/talisman/pull/177 with the main .bib file, that should work. The rest of the references were either very old and without a DOI or without a permanent identifier, so happy to leave as they are.

After you do this, can you please make a tagged release and archive the repo, and report the version number and archive DOI in this thread?

Not long now!

Yomguithereal commented 4 years ago

Sorry, what do you mean by "archive the repo"?

kakiac commented 4 years ago

@Yomguithereal apologies - I should have been clearer - Upon successful completion of the review, you need to make a tagged release of the software (via github, see here for info https://git-scm.com/book/en/v2/Git-Basics-Tagging), and deposit a copy of the repository with a data-archiving service such as Zenodo or figshare.

You then need to get the DOI you will be assigned from them (it is automatically assigned once you create a repository).

As soon as you have both copy the release version number (from github) and the DOI in this thread.

See here for more info under section The review process https://joss.readthedocs.io/en/latest/submitting.html#submitting-your-paper - let me know if you need any help!

An example of a recent tagged release on github is here: https://github.com/kgoldfeld/simstudy/ archived on Zenodo with the DOI: https://doi.org/10.5281/zenodo.4134675

kthyng commented 4 years ago

@kakiac Please wait to ping openjournals/joss-eics until a submission is ready to be taken over, if possible. Just saves a little bit of email traffic for us. Your last checkbox isn't needed since that happens with @whedon accept.

Yomguithereal commented 4 years ago

@kakiac the release is 1.1.3-zenodo and the attached DOI is 10.5281/zenodo.4247869 (but the https://doi.org/10.5281/zenodo.4247869 url does not yet work as I write this message. I guess it is just Zenodo/DOI lag?).

kakiac commented 4 years ago

Yes, no worries it worked for me just now, I think it takes some time to get registered in the doi database, I think. Thanks for doing that!

@Yomguithereal it seems that the authors list of your JOSS paper and the zenodo archive does not match. Can you please make sure they align?

kakiac commented 4 years ago

@kakiac Please wait to ping openjournals/joss-eics until a submission is ready to be taken over, if possible. Just saves a little bit of email traffic for us. Your last checkbox isn't needed since that happens with @whedon accept.

Oh apologies! I thought I needed to get that tagged if accepted by reviewers! Will bear that in mind in the future!

kthyng commented 4 years ago

@kakiac Your list looks correct now! Thanks!

kakiac commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

kakiac commented 4 years ago

@whedon set 1.1.3-zenodo as version

whedon commented 4 years ago

OK. 1.1.3-zenodo is the version.

kakiac commented 4 years ago

@Yomguithereal not sure if you spotted my message above - it seems that the authors list of your JOSS paper and the zenodo archive does not match. Can you please make sure they align?

kakiac commented 4 years ago

@whedon set 10.5281/zenodo.4247869 as archive

whedon commented 4 years ago

OK. 10.5281/zenodo.4247869 is the archive.

Yomguithereal commented 4 years ago

Is this ok now?

kakiac commented 4 years ago

Yes, thank you :)

kakiac commented 4 years ago

@whedon accept

openjournals / joss-reviews