[REVIEW]: WordTokenizers.jl: Basic tools for tokenizing natural language in Julia

whedon commented 4 years ago

Submitting author: @oxinabox (Lyndon White) Repository: https://github.com/JuliaText/WordTokenizers.jl/ Version: v0.5.4 Editor: @will-rowe Reviewer: @leios, @ninjin Archive: 10.5281/zenodo.3663390

Status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/883b18d3c1cfbf8a5b2b53acb1f5ab81"><img src="https://joss.theoj.org/papers/883b18d3c1cfbf8a5b2b53acb1f5ab81/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/883b18d3c1cfbf8a5b2b53acb1f5ab81/status.svg)](https://joss.theoj.org/papers/883b18d3c1cfbf8a5b2b53acb1f5ab81)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@leios & @ninjin, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

Make sure you're logged in to your GitHub account
Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @will-rowe know.

✨ Please try and complete your review in the next two weeks ✨

Review checklist for @leios

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@oxinabox) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[x] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[x] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

Review checklist for @ninjin

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@oxinabox) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[x] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[x] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

whedon commented 4 years ago

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @leios, @ninjin it looks like you're currently assigned to review this paper :tada:.

:star: Important :star:

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf

whedon commented 4 years ago

Attempting to check references...

whedon commented 4 years ago

Attempting PDF compilation. Reticulating splines etc...

whedon commented 4 years ago


OK DOIs

- 10.1137/141000671 is OK
- 10.1609/aaai.v33i01.33016843 is OK

MISSING DOIs

- None

INVALID DOIs

- None

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

oxinabox commented 4 years ago

cc @Ayushk4

will-rowe commented 4 years ago

Hi @leios and @ninjin. Just chasing up this review - don't worry if you've not started yet, I realise it was the holidays etc.

Please let me know if you need anything. Cheers.

will-rowe commented 4 years ago

Hi @leios, @ninjin - just checking in with the reviews here. Can you please let me know when you might expect to have them completed?

leios commented 4 years ago

Hey @will-rowe I am really sorry. I had a week to do it, but was finishing my thesis revisions during that time, then I had to move to my new position. I haven't had time to set up my stuff in my new place, so I don't know when I can do a review. By the earliest, this weekend. I am making it a priority to do then, but I understand if you want to try to find a new reviewer. Sorry again for the trouble.

ghost commented 4 years ago

@will-rowe: I should be able to handle it within the week.

will-rowe commented 4 years ago

Thanks @leios and @ninjin. Those timelines work for me. Let me know @leios if you don't manage to find time in the next couple of weeks. I'll ping again in a week to chase up

leios commented 4 years ago

Hey, trying to start the review. Sorry I'm late. Also sorry if this is a stupid question, but how do I check the boxes? Last time I did a review, I could just click on them, but this time I do not seem to have permission.

leios commented 4 years ago

I will submit a PR fixing a number of typos in the README and paper. I have a general question on the scope of this software regarding asian language processing, such as Japanese. I know that this is possible with TinySegmenter.jl, but is there a way to use this package with WordTokenizers?

Ayushk4 commented 4 years ago

WordTokenizer's default tokeniser doesn't work well on some Asian languages like Chinese, Japanese.

But, we can set up TinySegmenter.jl in the same way as RevTok.jl example given in readme. Here's an example with the Japanese sample text taken from TinySegmenter.jl.

julia> using WordTokenizers

julia> tokenize("आप कैसे हैं?") |> print # Default tokenizer on Hindi
["आप", "कैसे", "हैं", "?"]

julia> tokenize("私の名前は中野です") |> print # Default tokenizer on Japanese 
["私の名前は中野です"]

julia> import TinySegmenter

julia> set_tokenizer(TinySegmenter.tokenize)

julia> tokenize("私の名前は中野です") |> print # TinySegmenter's tokenizer on Japanese
SubString{String}["私", "の", "名前", "は", "中野", "です"]

leios commented 4 years ago

Ah. I was missing the .tokenize. I don't know why I didn't put it there, clearly the documentation does the same thing for Revtok. Thanks!

With that, I think I can check all the boxes on the review!

oxinabox commented 4 years ago

I have a general question on the scope of this software regarding asian language processing, such as Japanese.

For interest the default tokenizer TokTok has:

been tested on, and gives reasonably good results for English, Persian, Russian, Czech, French, German, Vietnamese, Tajik, and a few others.

Which is to say a bunch of languages using the Latin (English, French, German, Vietnamese), or the Cyrillic (Russian, Czech, Tajik), or Persian alphabet (Farsi). All of which use space as primary token seperation.

We don't yet have anything that supports the morpheme segementation needed for languages like Japanese or Chinese.

I would speculate, just based on the people involved in JuliaText, we are likely to get support for 1 or more Indian languages before anything else. Most of which, as I understand it are also space seperated.

I know that this is possible with TinySegmenter.jl, but is there a way to use this package with WordTokenizers?

@Ayushk4 updated the readme now with that example. Which i think is a much better one, as RevTok.jl is unmaintained, and also we have our own built in RevTok these days.

leios commented 4 years ago

I don't see the updated readme yet, if you are going to change it, could you use something like, "ジュリアプログラミング言語が大好きです！"

It translate to "I love the Julia programming language!" but also shows how it segments all three alphabets (it lumps Hiragana words together, which might be an issue for some people).

will-rowe commented 4 years ago

Hi @leios. You should be able to edit the top comment in this thread - click the 3 dots and then edit to update your checklist (via markdown). Let me know if you're stuck - and thanks for the review!

leios commented 4 years ago

@will-rowe I don't have that option, only "copy link" and "quote reply"

will-rowe commented 4 years ago

hmmm - did you accept the invite at the top of the thread? Link here. If you have already done this, let me know and I'll have to ask someone as I'm out of ideas!

leios commented 4 years ago

Sorry, I should have known that was a requirement. Filled in everything now. Thanks!

will-rowe commented 4 years ago

Hi @ninjin - just pinging you to see how your review is going? Thanks!

ghost commented 4 years ago

@will-rowe: Ouch, sorry. On it today. Sorry for being the slowest reviewer you have on the roster and thank you for the poke.

@leios: I would prefer: “日本語はひらがなとカタカナと漢字で表されます。” (“Japanese is expressed using hiragana, katakana, and kanji.”) over your example sentence. Sure, this one is way more formal, but yours somehow sounds a bit odd to me. Perhaps because it expresses something somewhat intimate in a formal way?

ghost commented 4 years ago

Repository:

No installation instructions as far as I can see. Although a bit silly for Julia, a single reference to the necessary command is enough.

Paper:

I do not believe I have ever seen normalisation as a part of tokenisation, thus I would remove: “Such word tokenization also often includes some normalizing, such as correcting unusual spellings or removing all punctuations.”.
It may be worth mentioning that “sentence splitting” is another term often used for “sentence segmentation”. For the record, I have never seen “sentence tokenisation” outside of NLTK, but I am happy to be corrected.

Non-mandatory (just comments, really):

It may be wise to consider including purely statistical tokenisation algorithms such as those used by BERT and friends. They are becoming more and more popular and I suspect they may be the standard very soon.
For some real competition in regards to performance, see this. It went public post-submission though.

All in all, a nice solid package. Well done!

will-rowe commented 4 years ago

Thanks for some excellent reviews @leios and @ninjin.

@oxinabox - can you please address @ninjin 's comments and let me know when you are ready for me to take a final look. We should be close to acceptance!

oxinabox commented 4 years ago

Repository:

No installation instructions as far as I can see. Although a bit silly for Julia, a single reference to the necessary command is enough.

Done.

Paper:

I do not believe I have ever seen normalisation as a part of tokenisation, thus I would remove: “Such word tokenization also often includes some normalizing, such as correcting unusual spellings or removing all punctuations.”.

I loosely speaking disagre, but:
I have removed that sentence, as it is overstated. it might be better to say "occucationally" or sometimes. But it doesn't belong right at the start of the paper, but rather in the docstrings of particular tokenizers. Arguably its undersirable behavour. Examples: The Twitter Tokenizer (both ours and the original from NLTK) normalizes: waaaaayyyy into waaayyy.

Apparently the Stanford Tokenizer has a americanize flag which normalizes: colour to color. (Though this really feels like it is stepping out side the bounds of normalizing during tokenization).

On the more minor end, the original Penn tokenizer normalizes " (double quote) to '' (pair of single quote).

It may be worth mentioning that “sentence splitting” is another term often used for “sentence segmentation”. For the record, I have never seen “sentence tokenisation” outside of NLTK, but I am happy to be corrected.

Done in https://github.com/JuliaText/WordTokenizers.jl/pull/45 my coauthor will post here one merged

Non-mandatory (just comments, really):

It may be wise to consider including purely statistical tokenisation algorithms such as those used by BERT and friends. They are becoming more and more popular and I suspect they may be the standard very soon.

Good idea, https://github.com/JuliaText/WordTokenizers.jl/issues/44

For some real competition in regards to performance, see this. It went public post-submission though.

I think that will indeed be significantly hard to been, but we should see how we measure up. https://github.com/JuliaText/WordTokenizers.jl/issues/46 Beyond scope of this submission though.

All in all, a nice solid package. Well done!

Thanks

ghost commented 4 years ago

@oxinabox: Agreed, if it was “occasionally” I would agree. I think there are alternative explanations for your examples though. The first two are examples of canonicalisation that most likely have been integrated into the tokeniser because it is the first piece of code that touches the “raw” text. For PTB, these are necessary for the text to be consistent with the pre-processing that PTB did back in the 90s as many parsers will choke if they see anything unexpected.

Thanks again for a solid package and I am 200% in support of https://github.com/JuliaText/WordTokenizers.jl/issues/44!

Ayushk4 commented 4 years ago

The changes have been made to the paper. Thanks, @ninjin for the suggestions.

Ayushk4 commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

will-rowe commented 4 years ago

Great! Thank you everyone. I'll give it a final read through now

will-rowe commented 4 years ago

@whedon check references

whedon commented 4 years ago

Reference check summary:

OK DOIs

- 10.1137/141000671 is OK
- 10.1609/aaai.v33i01.33016843 is OK

MISSING DOIs

- None

INVALID DOIs

- None

will-rowe commented 4 years ago

Okay - everything looks good to me. Please can you tag a final release and archive it - post the version number and DOI back here.

oxinabox commented 4 years ago

DOI: 10.5281/zenodo.3648713 Version 0.5.4

Thanks

will-rowe commented 4 years ago

Thanks @oxinabox . The author names on the archive should match the paper authors please

oxinabox commented 4 years ago

Thanks @oxinabox . The author names on the archive should match the paper authors please

I am not sure what is controlling the names in the archive. All the people who as not listed on the paper, who are listed in the archive have made only small corrections to the spelling in the Readme or comments. (Bar one who made a small fix to a correr case in the regex.) These changes are appreciated, but I assume fall below the contribution threashold for co-author ship on a JOSS paper. Perhaps I am wrong though?

will-rowe commented 4 years ago

No - you can have the author list as you'd like it (i.e. your in paper). I'm not sure why zenodo has autofilled in all of those authors. You can click edit on the zenodo page and then manually remove the authors which are not on your paper.

danielskatz commented 4 years ago

You can use a zenodo.json file to tell zenodo what you want automatically, but it's not as easy as it should be be - see https://github.com/zenodo/zenodo/issues/1606 for some discussion on this

oxinabox commented 4 years ago

Ok, should be fixed

will-rowe commented 4 years ago

@whedon set 10.5281/zenodo.3648713 as archive

whedon commented 4 years ago

OK. 10.5281/zenodo.3648713 is the archive.

will-rowe commented 4 years ago

@whedon set v0.5.4 as version

whedon commented 4 years ago

OK. v0.5.4 is the version.

will-rowe commented 4 years ago

Thanks @oxinabox for a nice submission. And many thanks to @leios and @ninjin for your excellent reviews.

Pinging @openjournals/joss-eics to move this to acceptance

kthyng commented 4 years ago

@oxinabox I'm taking over for the rest of your submission. Can you edit the Zenodo data a little more, so that the title matches that of your paper?

kthyng commented 4 years ago

@oxinabox I've made some changes to your text and bib file in PRs. Please merge what you agree with: #47 and #48

oxinabox commented 4 years ago

done, thanks good changes

kthyng commented 4 years ago

@oxinabox I'm taking over for the rest of your submission. Can you edit the Zenodo data a little more, so that the title matches that of your paper?

@oxinabox Ok also this part.

oxinabox commented 4 years ago

@oxinabox I'm taking over for the rest of your submission. Can you edit the Zenodo data a little more, so that the title matches that of your paper?

@oxinabox Ok also this part.

I thought I did so. I don't know why Zenodo is not showing the new title. Am I editting the wrong place?

openjournals / joss-reviews