openjournals / joss-reviews

Reviews for the Journal of Open Source Software
Creative Commons Zero v1.0 Universal
703 stars 36 forks source link

[REVIEW]: LAS: an integrated language analysis tool for multiple languages #35

Closed whedon closed 7 years ago

whedon commented 8 years ago

Submitting author: @jiemakel (Eetu Mäkelä) Repository: https://github.com/jiemakel/las Version: v1.4.8 Editor: @arfon Reviewer: @desilinguist Archive: https://dx.doi.org/10.5281/zenodo.160256

Status

status

Status badge code:

HTML: <a href="http://joss.theoj.org/papers/87f0093e30a01b1078d8dea41476a08a"><img src="http://joss.theoj.org/papers/87f0093e30a01b1078d8dea41476a08a/status.svg"></a>
Markdown: [![status](http://joss.theoj.org/papers/87f0093e30a01b1078d8dea41476a08a/status.svg)](http://joss.theoj.org/papers/87f0093e30a01b1078d8dea41476a08a)

Reviewer questions

Conflict of interest

Paper PDF: 10.21105.joss.00035.pdf

arfon commented 8 years ago

/ cc @openjournals/joss-reviewers - would anyone be willing to review this submission?

If you would like to review this submission then please comment on this thread so that others know you're doing a review (so as not to duplicate effort). Something as simple as :hand: I am reviewing this will suffice.

Reviewer instructions

Any questions, please ask for help by commenting on this issue! 🚀

desilinguist commented 8 years ago

I'm happy to review this.

desilinguist commented 8 years ago

I'm teaching a class at a linguistics summer school all next week so I won't be able to get to this until the week after. I'm sorry for the delay and happy to yield to another reviewer if they want to review it in the meantime.

desilinguist commented 8 years ago

Installation

There's no specific installation section just a section called "Running". It would be nice if we could just rename that section to "Installation" even though there's nothing really to install since that's what people generally look for. And it would be nice to have that section contain the actual steps needed to run LAS on the example referenced later.

Another point is that the las binary that you download from the releases page needs to be chmod'ed 755 before it can run at least on my mac.

Functionality

I downloaded the file and tested on some Hindi text and it seems to work fine. However, as an aside, the web service inaccurately detects the same text as German for some reason (even though the underlying language detector says hi (denoting Hindi) but for some reason it gets over-ruled by the language recognizer?

Update: Actually, I just noticed that the command line tool created .language file which also misidentifies the Hindi text as German.

Another problem I had was that when I called las recognize test.txt on my Hindi source file instead of las identify test.txt, I get the following exception.

10:15:03.846 [main] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[hi:0.8542185083632905], DetectedLanguage[mr:0.14570001916332004]]
Exception in thread "main" java.lang.IllegalArgumentException: in must not be null!
    at opennlp.tools.util.model.BaseModel.<init>(BaseModel.java:179)
    at opennlp.tools.tokenize.TokenizerModel.<init>(TokenizerModel.java:125)
    at fi.seco.lexical.combined.CombinedLexicalAnalysisService.getTokenizer(CombinedLexicalAnalysisService.java:125)
    at fi.seco.lexical.combined.CombinedLexicalAnalysisService.recognize(CombinedLexicalAnalysisService.java:617)
    at LASCommandLineTool$.recognize(LASCommandLineTool.scala:299)
    at LASCommandLineTool$$anonfun$main$9$$anonfun$apply$6.apply(LASCommandLineTool.scala:219)
    at LASCommandLineTool$$anonfun$main$9$$anonfun$apply$6.apply(LASCommandLineTool.scala:218)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at LASCommandLineTool$$anonfun$main$9.apply(LASCommandLineTool.scala:218)
    at LASCommandLineTool$$anonfun$main$9.apply(LASCommandLineTool.scala:209)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at LASCommandLineTool$.main(LASCommandLineTool.scala:209)
    at LASCommandLineTool.main(LASCommandLineTool.scala)

Is it simply because the LAS recognizer doesn't actually support Hindi (according to the help message)? If so, a more interpretable exception might be more useful here.

Performance

Although there are no specific claims made about the performance, the README does seem to imply that re-running after the first time should be much faster. I didn't really experience that. For example, running las identify on the same single sentence text file took about 25-30 seconds no matter how many times I re-ran it.

Documentation

This section seems lacking in general.

There isn't really a statement of need. It would be nice to have something that explains what the tool is useful for and also all of the modes of analysis it provides (recognition, identification etc.).

The package is self-contained so there's no need for dependencies, I guess. However, there's no information for people who are interested in contributing. There also don't seem to be any automated tests to run.

There's an example which is useful but it seems quite contrived. It would be nice to have actual examples for all of the analysis modes.

Software paper

The paper actually does have actual text motivating the need for this software. This text can be used to add the statement of need to the repository README.

Recommendation

Accept with minor changes. I think a more detailed README that explains the motivation, has better examples, and better installation/run instructions is all this package needs. It should also emphasize more prominently the fact that it's a wrapper around other existing projects.

pjotrp commented 8 years ago

This software is essentially a 300+ line script that wraps existing Java tools. I would like to ask the author to confirm here that he thinks this software is a valuable contribution to research. I want to point out that once published this work will be visible for a long time. Adding an example of having used this tool in research would be a good idea.

jiemakel commented 8 years ago

From @desilinguist:

Installation

I've reworded the section to be "Installation and running".

However, as an aside, the web service inaccurately detects the same text as German for some reason (even though the underlying language detector says hi (denoting Hindi) but for some reason it gets over-ruled by the language recognizer?

The tool employs a simple voting mechanism to tally scores between recognizers. In this case, probably because the other tools don't understand Hindi, it gets outvoted by the poorer best guesses of the other methods. (For such cases, the results of the individual recognizers are included in the results, so you can add your own post-processing selector that is better tuned to your usage scenario.)

Another problem I had was that when I called las recognize test.txt on my Hindi source file instead of las identify test.txt, I get the following exception.

This is because recognize is actually a separate (and previously undocumented) function that counts the number of words a particular language processor recognizes, and it doesn't support Hindi. However, you also encountered a bug that caused an exception to be thrown instead of handling this in a better manner, which I've now fixed.

The reason for this functionality is so I could estimate the number of OCR errors in automatically scanned historical newspapers. I've added a section to the README to document this.

Although there are no specific claims made about the performance, the README does seem to imply that re-running after the first time should be much faster. I didn't really experience that.

This was a miscommunication. What the README means to say is that at each run, initial load up takes a significant time, so one should configure their workflow so that as much data as possible is funneled to the tool in a single run (e.g. giving the tool multiple files to process on the command line instead of running the tool once for every file). I've tried to make this more clear in the README.

Documentation

I've copied into the README the statement of need as well as the note about it being first and foremost a wrapper from the paper.

Regarding the examples, while they are contrived, I do want to keep them short for the README, and a full worked-through example for all the different modes of operation would in my opinion be too much. What I've done is added a second simple example for each mode of operation that isn't the one Finnish example (the Finnish ones being both important and already understandable for the primary audience of the tool).

I've also added additional information on how to run each mode to their respective sections.

The above changes are visible in this commit.

From @pjotrp:

This software is essentially a 300+ line script that wraps existing Java tools.

While I do think the worth of the tool is precisely in packaging existing tools together and tuning them to work with each other, it actually isn't just 300 lines. I submitted the (repository containing the) command line version, because I think it is the one most usable by the intended audience. 5500 more lines of code are contained in the seco-lexicalanalysis repository, which contains the code common to this command line version and the web service version (seco-lexicalanalysis-play). They in turn refer to seco-hfst, which has 1500 additional LOC.

In addition, as the README and paper both point out, while I view the integration in general useful, the real core contribution of the work is in the in-depth work on integrating and expanding the Finnish pipeline included in the tool, where for example the diff of our omorfi fork is ~12000 lines against the base version.

Adding an example of having used this tool in research would be a good idea.

The second paragraph of the paper already deals with this, referring to three academic articles that use the tool. The reason I submitted the tool in the first place was also that I had a request from an outside user, who wanted to know how they could refer to the tool in their own article. So, to state it explicitly, I do think this is a valuable contribution to research.

desilinguist commented 8 years ago

@jiemakel thanks so much for addressing most of my comments! I think the changes you have made are quite useful. There are still no automated tests and still not information on how a new developer can contribute. I think some tests should be added but I am not going to make that issue a blocker for publication. It would be nice for the README to contain a little blurb about how one would go about contributing as a developer etc.

As far as the utility of the tool goes, as someone who works in the NLP (natural language processing) field, I think las is definitely going to be useful to people in the field.

pjotrp commented 8 years ago

@jiemakel thanks for the clarification. If you have not done so, could you add some pointers to the other work in the paper or the README, so it is clear that this is a major effort?

jiemakel commented 8 years ago

I've now added a "Contributing" section to the README, which additionally links to the other github repositories whose code is used here.

Regarding automated tests, I do agree they'd start to be useful now that the tool is seeing wider use. I'll start adding them as I do further work, but right now don't have the time to draft a comprehensive battery from the start.

jiemakel commented 8 years ago

@desilinguist Could this be moved forward?

arfon commented 7 years ago

@desilinguist - This looks like this is in reasonable shape to accept. Can you confirm?

desilinguist commented 7 years ago

I'm happy to accept this. Sorry for the delay.

arfon commented 7 years ago

I'm happy to accept this. Sorry for the delay.

✨ thanks for the update @desilinguist.

@jiemakel - at this point could you make an archive of the reviewed software in Zenodo/figshare/other service and update this thread with the DOI of the archive? I can then move forward with accepting the submission.

jiemakel commented 7 years ago

The DOI for the latest version is: 10.5281/zenodo.160256

arfon commented 7 years ago

Thanks again for the review @desilinguist!

@jiemakel your paper is now accepted into JOSS. Your DOI is http://dx.doi.org/10.21105/joss.00035 🎉 🚀 💥