Closed whedon closed 4 years ago
Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @leios, @ninjin it looks like you're currently assigned to review this paper :tada:.
:star: Important :star:
If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿
To fix this do the following two things:
For a list of things I can do to help you, just type:
@whedon commands
For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:
@whedon generate pdf
Attempting to check references...
Attempting PDF compilation. Reticulating splines etc...
OK DOIs
- 10.1137/141000671 is OK
- 10.1609/aaai.v33i01.33016843 is OK
MISSING DOIs
- None
INVALID DOIs
- None
cc @Ayushk4
Hi @leios and @ninjin. Just chasing up this review - don't worry if you've not started yet, I realise it was the holidays etc.
Please let me know if you need anything. Cheers.
Hi @leios, @ninjin - just checking in with the reviews here. Can you please let me know when you might expect to have them completed?
Hey @will-rowe I am really sorry. I had a week to do it, but was finishing my thesis revisions during that time, then I had to move to my new position. I haven't had time to set up my stuff in my new place, so I don't know when I can do a review. By the earliest, this weekend. I am making it a priority to do then, but I understand if you want to try to find a new reviewer. Sorry again for the trouble.
@will-rowe: I should be able to handle it within the week.
Thanks @leios and @ninjin. Those timelines work for me. Let me know @leios if you don't manage to find time in the next couple of weeks. I'll ping again in a week to chase up
Hey, trying to start the review. Sorry I'm late. Also sorry if this is a stupid question, but how do I check the boxes? Last time I did a review, I could just click on them, but this time I do not seem to have permission.
I will submit a PR fixing a number of typos in the README and paper. I have a general question on the scope of this software regarding asian language processing, such as Japanese. I know that this is possible with TinySegmenter.jl, but is there a way to use this package with WordTokenizers?
WordTokenizer's default tokeniser doesn't work well on some Asian languages like Chinese, Japanese.
But, we can set up TinySegmenter.jl in the same way as RevTok.jl example given in readme. Here's an example with the Japanese sample text taken from TinySegmenter.jl.
julia> using WordTokenizers
julia> tokenize("आप कैसे हैं?") |> print # Default tokenizer on Hindi
["आप", "कैसे", "हैं", "?"]
julia> tokenize("私の名前は中野です") |> print # Default tokenizer on Japanese
["私の名前は中野です"]
julia> import TinySegmenter
julia> set_tokenizer(TinySegmenter.tokenize)
julia> tokenize("私の名前は中野です") |> print # TinySegmenter's tokenizer on Japanese
SubString{String}["私", "の", "名前", "は", "中野", "です"]
Ah. I was missing the .tokenize
. I don't know why I didn't put it there, clearly the documentation does the same thing for Revtok. Thanks!
With that, I think I can check all the boxes on the review!
I have a general question on the scope of this software regarding asian language processing, such as Japanese.
For interest the default tokenizer TokTok has:
been tested on, and gives reasonably good results for English, Persian, Russian, Czech, French, German, Vietnamese, Tajik, and a few others.
Which is to say a bunch of languages using the Latin (English, French, German, Vietnamese), or the Cyrillic (Russian, Czech, Tajik), or Persian alphabet (Farsi). All of which use space as primary token seperation.
We don't yet have anything that supports the morpheme segementation needed for languages like Japanese or Chinese.
I would speculate, just based on the people involved in JuliaText, we are likely to get support for 1 or more Indian languages before anything else. Most of which, as I understand it are also space seperated.
I know that this is possible with TinySegmenter.jl, but is there a way to use this package with WordTokenizers?
@Ayushk4 updated the readme now with that example. Which i think is a much better one, as RevTok.jl is unmaintained, and also we have our own built in RevTok these days.
I don't see the updated readme yet, if you are going to change it, could you use something like, "ジュリアプログラミング言語が大好きです!"
It translate to "I love the Julia programming language!" but also shows how it segments all three alphabets (it lumps Hiragana words together, which might be an issue for some people).
Hi @leios. You should be able to edit the top comment in this thread - click the 3 dots and then edit to update your checklist (via markdown). Let me know if you're stuck - and thanks for the review!
@will-rowe I don't have that option, only "copy link" and "quote reply"
hmmm - did you accept the invite at the top of the thread? Link here. If you have already done this, let me know and I'll have to ask someone as I'm out of ideas!
Sorry, I should have known that was a requirement. Filled in everything now. Thanks!
Hi @ninjin - just pinging you to see how your review is going? Thanks!
@will-rowe: Ouch, sorry. On it today. Sorry for being the slowest reviewer you have on the roster and thank you for the poke.
@leios: I would prefer: “日本語はひらがなとカタカナと漢字で表されます。” (“Japanese is expressed using hiragana, katakana, and kanji.”) over your example sentence. Sure, this one is way more formal, but yours somehow sounds a bit odd to me. Perhaps because it expresses something somewhat intimate in a formal way?
Repository:
Paper:
I do not believe I have ever seen normalisation as a part of tokenisation, thus I would remove: “Such word tokenization also often includes some normalizing, such as correcting unusual spellings or removing all punctuations.”.
It may be worth mentioning that “sentence splitting” is another term often used for “sentence segmentation”. For the record, I have never seen “sentence tokenisation” outside of NLTK, but I am happy to be corrected.
Non-mandatory (just comments, really):
It may be wise to consider including purely statistical tokenisation algorithms such as those used by BERT and friends. They are becoming more and more popular and I suspect they may be the standard very soon.
For some real competition in regards to performance, see this. It went public post-submission though.
All in all, a nice solid package. Well done!
Thanks for some excellent reviews @leios and @ninjin.
@oxinabox - can you please address @ninjin 's comments and let me know when you are ready for me to take a final look. We should be close to acceptance!
Repository:
No installation instructions as far as I can see. Although a bit silly for Julia, a single reference to the necessary command is enough.
Done.
Paper:
I do not believe I have ever seen normalisation as a part of tokenisation, thus I would remove: “Such word tokenization also often includes some normalizing, such as correcting unusual spellings or removing all punctuations.”.
I loosely speaking disagre, but:
I have removed that sentence, as it is overstated. it might be better to say "occucationally" or sometimes. But it doesn't belong right at the start of the paper, but rather in the docstrings of particular tokenizers. Arguably its undersirable behavour.
Examples:
The Twitter Tokenizer (both ours and the original from NLTK) normalizes: waaaaayyyy
into waaayyy
.
Apparently the Stanford Tokenizer has a americanize
flag
which normalizes: colour
to color
.
(Though this really feels like it is stepping out side the bounds of normalizing during tokenization).
On the more minor end, the original Penn tokenizer normalizes "
(double quote) to ''
(pair of single quote).
It may be worth mentioning that “sentence splitting” is another term often used for “sentence segmentation”. For the record, I have never seen “sentence tokenisation” outside of NLTK, but I am happy to be corrected.
Done in https://github.com/JuliaText/WordTokenizers.jl/pull/45 my coauthor will post here one merged
Non-mandatory (just comments, really):
It may be wise to consider including purely statistical tokenisation algorithms such as those used by BERT and friends. They are becoming more and more popular and I suspect they may be the standard very soon.
Good idea, https://github.com/JuliaText/WordTokenizers.jl/issues/44
For some real competition in regards to performance, see this. It went public post-submission though.
I think that will indeed be significantly hard to been, but we should see how we measure up. https://github.com/JuliaText/WordTokenizers.jl/issues/46 Beyond scope of this submission though.
All in all, a nice solid package. Well done!
Thanks
@oxinabox: Agreed, if it was “occasionally” I would agree. I think there are alternative explanations for your examples though. The first two are examples of canonicalisation that most likely have been integrated into the tokeniser because it is the first piece of code that touches the “raw” text. For PTB, these are necessary for the text to be consistent with the pre-processing that PTB did back in the 90s as many parsers will choke if they see anything unexpected.
Thanks again for a solid package and I am 200% in support of https://github.com/JuliaText/WordTokenizers.jl/issues/44!
The changes have been made to the paper. Thanks, @ninjin for the suggestions.
@whedon generate pdf
Great! Thank you everyone. I'll give it a final read through now
@whedon check references
Reference check summary:
OK DOIs
- 10.1137/141000671 is OK
- 10.1609/aaai.v33i01.33016843 is OK
MISSING DOIs
- None
INVALID DOIs
- None
Okay - everything looks good to me. Please can you tag a final release and archive it - post the version number and DOI back here.
DOI: 10.5281/zenodo.3648713 Version 0.5.4
Thanks
Thanks @oxinabox . The author names on the archive should match the paper authors please
Thanks @oxinabox . The author names on the archive should match the paper authors please
I am not sure what is controlling the names in the archive. All the people who as not listed on the paper, who are listed in the archive have made only small corrections to the spelling in the Readme or comments. (Bar one who made a small fix to a correr case in the regex.) These changes are appreciated, but I assume fall below the contribution threashold for co-author ship on a JOSS paper. Perhaps I am wrong though?
No - you can have the author list as you'd like it (i.e. your in paper). I'm not sure why zenodo has autofilled in all of those authors. You can click edit on the zenodo page and then manually remove the authors which are not on your paper.
You can use a zenodo.json file to tell zenodo what you want automatically, but it's not as easy as it should be be - see https://github.com/zenodo/zenodo/issues/1606 for some discussion on this
Ok, should be fixed
@whedon set 10.5281/zenodo.3648713 as archive
OK. 10.5281/zenodo.3648713 is the archive.
@whedon set v0.5.4 as version
OK. v0.5.4 is the version.
Thanks @oxinabox for a nice submission. And many thanks to @leios and @ninjin for your excellent reviews.
Pinging @openjournals/joss-eics to move this to acceptance
@oxinabox I'm taking over for the rest of your submission. Can you edit the Zenodo data a little more, so that the title matches that of your paper?
done, thanks good changes
@oxinabox I'm taking over for the rest of your submission. Can you edit the Zenodo data a little more, so that the title matches that of your paper?
@oxinabox Ok also this part.
@oxinabox I'm taking over for the rest of your submission. Can you edit the Zenodo data a little more, so that the title matches that of your paper?
@oxinabox Ok also this part.
I thought I did so. I don't know why Zenodo is not showing the new title. Am I editting the wrong place?
Submitting author: @oxinabox (Lyndon White) Repository: https://github.com/JuliaText/WordTokenizers.jl/ Version: v0.5.4 Editor: @will-rowe Reviewer: @leios, @ninjin Archive: 10.5281/zenodo.3663390
Status
Status badge code:
Reviewers and authors:
Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)
Reviewer instructions & questions
@leios & @ninjin, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:
The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @will-rowe know.
✨ Please try and complete your review in the next two weeks ✨
Review checklist for @leios
Conflict of interest
Code of Conduct
General checks
Functionality
Documentation
Software paper
Review checklist for @ninjin
Conflict of interest
Code of Conduct
General checks
Functionality
Documentation
Software paper