openjournals / joss-reviews

Reviews for the Journal of Open Source Software
Creative Commons Zero v1.0 Universal
721 stars 38 forks source link

[REVIEW]: pyndl: Naïve Discriminative Learning in Python #4515

Closed editorialbot closed 1 year ago

editorialbot commented 2 years ago

Submitting author: !--author-handle-->@derNarr<!--end-author-handle-- (Konstantin Sering) Repository: https://github.com/quantling/pyndl Branch with paper.md (empty if default branch): paper Version: v1.1.1 Editor: !--editor-->@osorensen<!--end-editor-- Reviewers: @frankier, @jinhangjiang, @VenkteshV Archive: 10.5281/zenodo.7410272

Status

status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/fc33a198af77e42e4f98ea80e94cb8d7"><img src="https://joss.theoj.org/papers/fc33a198af77e42e4f98ea80e94cb8d7/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/fc33a198af77e42e4f98ea80e94cb8d7/status.svg)](https://joss.theoj.org/papers/fc33a198af77e42e4f98ea80e94cb8d7)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@frankier & @jinhangjiang & @VenkteshV, your review will be checklist based. Each of you will have a separate checklist that you should update when carrying out your review. First of all you need to run this command in a separate comment to create the checklist:

@editorialbot generate my checklist

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @osorensen know.

Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest

Checklists

📝 Checklist for @frankier

📝 Checklist for @VenkteshV

📝 Checklist for @jinhangjiang

editorialbot commented 2 years ago

Hello humans, I'm @editorialbot, a robot that can help you with some common editorial tasks.

For a list of things I can do to help you, just type:

@editorialbot commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@editorialbot generate pdf
editorialbot commented 2 years ago
Software report:

github.com/AlDanial/cloc v 1.88  T=0.07 s (783.4 files/s, 126191.2 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Python                          22           1125           1219           3321
reStructuredText                12            463            484            743
Cython                           5            129             81            568
TeX                              1             17              0            182
Markdown                         3             26              0            151
INI                              1              8              0             63
YAML                             5             10             11             60
JSON                             1              0              0             55
SQL                              1              0              0             51
make                             2              7              6             35
DOS Batch                        1              8              1             27
Bourne Shell                     1              2              1              6
-------------------------------------------------------------------------------
SUM:                            55           1795           1803           5262
-------------------------------------------------------------------------------

gitinspector failed to run statistical information for the repository
editorialbot commented 2 years ago

Wordcount for paper.md is 999

editorialbot commented 2 years ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1016/s0022-2496(02)00016-0 is OK
- 10.1037/a0023851 is OK
- 10.1080/23273798.2021.1954207 is OK
- 10.1017/s0022226719000203 is OK
- 10.3389/fcomm.2020.00017 is OK
- 10.21437/interspeech.2018-2420 is OK
- 10.1111/stan.12134 is OK
- 10.1080/23273798.2020.1815813 is OK
- 10.1515/cog-2021-0006 is OK
- 10.1515/9783110292022-006 is OK
- 10.31234/osf.io/prvzq is OK

MISSING DOIs

- 10.21236/ad0241531 may be a valid DOI for title: Adaptive switching circuits
- 10.1037/0003-066x.43.3.151 may be a valid DOI for title: Pavlovian conditioning. It’s not what you think it is.

INVALID DOIs

- None
editorialbot commented 2 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

derNarr commented 2 years ago

Added missing DOIs and checked that they lead to the right reference.

@editorialbot generate pdf

derNarr commented 2 years ago

@editorialbot generate pdf

editorialbot commented 2 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

derNarr commented 2 years ago

@editorialbot commands

editorialbot commented 2 years ago

Hello @derNarr, here are the things you can ask me to do:


# List all available commands
@editorialbot commands

# Get a list of all editors's GitHub handles
@editorialbot list editors

# Check the references of the paper for missing DOIs
@editorialbot check references

# Perform checks on the repository
@editorialbot check repository

# Adds a checklist for the reviewer using this command
@editorialbot generate my checklist

# Set a value for branch
@editorialbot set joss-paper as branch

# Generates the pdf paper
@editorialbot generate pdf

# Get a link to the complete list of reviewers
@editorialbot list reviewers
derNarr commented 2 years ago

@editorialbot check references

editorialbot commented 2 years ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.21236/ad0241531 is OK
- 10.1016/s0022-2496(02)00016-0 is OK
- 10.1037/a0023851 is OK
- 10.1080/23273798.2021.1954207 is OK
- 10.1017/s0022226719000203 is OK
- 10.3389/fcomm.2020.00017 is OK
- 10.21437/interspeech.2018-2420 is OK
- 10.1111/stan.12134 is OK
- 10.1080/23273798.2020.1815813 is OK
- 10.1515/cog-2021-0006 is OK
- 10.1515/9783110292022-006 is OK
- 10.1037/0003-066x.43.3.151 is OK
- 10.31234/osf.io/prvzq is OK

MISSING DOIs

- None

INVALID DOIs

- None
VenkteshV commented 2 years ago

Review checklist for @VenkteshV

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

frankier commented 2 years ago

Review checklist for @frankier

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

osorensen commented 2 years ago

@jinhangjiang can you please update us on how it's going with your review?

jinhangjiang commented 2 years ago

Yeah, I will have it done by this Sunday night CST.

Sent from my iPhone

On Jul 15, 2022, at 13:50, Øystein Sørensen @.***> wrote:

 @jinhangjiang can you please update us on how it's going with your review?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.

jinhangjiang commented 2 years ago

Review checklist for @jinhangjiang

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

editorialbot commented 2 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

jinhangjiang commented 2 years ago

Hi @osorensen, I have finished the first round of reviews. It looks like a mature, solid, and domain-dependent python package to me. The authors are able to provide sufficient documents to help users to understand what it can achieve. The example codes are reproducible from my end.

I only have two questions for the authors (@derNarr):

Please let me know what the next step is. Thanks!

osorensen commented 2 years ago

Thanks a lot for your review, @jinhangjiang. Since you have completed your checklist, you are done with your tasks now. However, I hope the author @derNarr can answer your questions here, and if possibly include your suggested improvements in the package.

osorensen commented 2 years ago

@VenkteshV, can you please update us on how it's going with your review?

derNarr commented 2 years ago
* First, based on what I read, the API only takes the data in a specific format. And it looks hard to scale up. Is it a normal convention in this field? Since one of the claimed contributions is to help study large corpora, it would be nice to explain a bit how well it can work with large text data.

The ndl.ndl learner expects the data to be in a tab-separated format where in the first column cues (inputs) are defined, where the only special delimiting character is the underscore _. In the second column the outcomes (outputs) are defined in the same way. This tab-separated file can / should be gunzipped (deflate compressed). As you can continue the learning, learning events can be split up to multiple event files.

The advantage and the origin of this event file format is that it is extremely transparent to the researcher, easy to inspect and that the time course of the learning events are defined by going from top to bottom in the file. As the research in our group was always concerned with the ordering of learning events and found some evidence that a naturalistic learning order is superior to random learning in modeling human reaction times. Naturalistic learning events usually have the same cue (or outcome) reappear in one phase of the learning a lot, whereas it might not be present at all present in the rest of the learning events. An example would be the cue/outcome "pyndl", which nearly only appears in this discussion and in research that uses this package, but you will not be exposed tot the word "pyndl" randomly on the Internet.

To be honest, the data format was in use before I joined the group and wrote pyndl to compute the weight matrices from this data format. But I think, it is extremely easy, allows for transparent use, should be accessible and readable in 20 to 50 years from now on, is around for at least 10 years. Therefore, it has many good features and is only slightly bloated. With pyndl.corpus there exists some code to create event files from different sources like OpenSubtiltes files or simple text files.

The data format runs with 100,000,000 (100 million) learning events, and I don't see a reason, why it not should scale up further. Compared to image, video and audio, text is small. @jinhangjiang If you still have the concern, can you specify why you believe it would not scale up? What data do you have in mind with which sizes?

* Second, since it can work with non-English text, it would be nice to give some showcases. Also, I was not able to locate a list of supported languages.

Everything that can be encoded into letters with Unicode code points (utf-8, utf-16, ...) can be computed with pyndl. You are right that we should add a small example that uses Chinese or Greek or Taiwanese symbols. I created an issue for that in pyndl https://github.com/quantling/pyndl/issues/214

We stress this point so much in the paper, as with the older implementations in the R programming language we regularly ran into problems with the Unicode processing, which was done in with libicu, and therefore it was a big deal for us to be able to run the learning on all the languages we wanted. The only thing we still needed to do is convert texts to utf-8 or read them into python 3.

@jinhangjiang thank you for your questions and your review! Did I answered you questions? Do you suggest that we should change something or have other ideas of improvements? Let me know as well, if I should elaborate on one end or another.

jinhangjiang commented 2 years ago

@derNarr Thank you for answering my questions throughout. I really appreciate the details you shared. When I said I thought it was hard to scale up, I was thinking about how to prepare the inputs in something like pandas/spark dataframe as I am more familiar with those frameworks. What you have explained makes sense to me. As you have experimented with 100 million learning events, I would suggest sharing some examples of estimated execution time with corresponding data volume. If I were about to use the package in my research, I would be interested in knowing this information in advance.

Looking forward to the small examples in different languages such as Chinese. It would be very interesting to see how it works with characters instead of letters.

Thank you for all your work and clarifications. Best.

VenkteshV commented 2 years ago

@VenkteshV, can you please update us on how it's going with your review?

Greetings. I am working on the same and will complete the first round of reviews by this week.

VenkteshV commented 2 years ago

@VenkteshV, can you please update us on how it's going with your review?

Greetings. I am working on the same and will complete the first round of reviews by this week.

Apologies. I got delayed. I am working on the same and will rap it up in few days.

frankier commented 2 years ago

I will start with a general review summary, before prioritizing the some of the GitHub issues (I have taken the liberty of including some issues not originally filed by myself here). I have expressed it in terms of what is needed for acceptance but of course there is always room for discussion.

The library looks to be quite mature overall, and have a clear potential target audience within the field of cognitive linguistics.

I think the statement of need could be a bit stronger. Some of the advantages mentioned such as maintainability are really advantages for the library author and not the library user.

A key advantage of using Python is interoperability with the Python ecosystem, however, the library does not exactly follow the conventions of any existing library, opting for its own custom file-based workflow. This may be fine for the purposes of this library since it is in line with the expectation of the audience, but on the other hand much of the reusability and convenience in the Python ecosystem comes from people reusing certain idioms (e.g. that of sklearn as mentioned here https://github.com/quantling/pyndl/issues/130 ). Perhaps you could also mention the usage of/cite xarray, since this seems to be a slightly higher level package which is in use.

At a high level, I see there being 4 layers/strata in the project from the lowest:

  1. The NDL/WH rule as independent of the file format used for the events and the weights matrix (does not seem to be directly exposed/documented -- see the comments of @jinhangjiang and https://github.com/quantling/pyndl/issues/213
    • Exposing/documenting this would be a "nice to have" and could create appeal to those outside the cognitive linguistic audience
  2. The NDL/WH rule as implemented in terms of the library's custom file formats
    • This is introduced in the documentation through a word form-semantics association (meaning recognition) example
  3. Event file generation routines which can convert "END OF DOCUMENT" delimited corpora files into
    • This is introduced in the documentation between a bigram-word form association (form recognition) example
  4. Convenience functions to convert from other corpora formats (currently only OpenSubtitles) to stratum 3

I think these strata -- and therefore different ways of using the library/workflow -- should be outlined in the paper. At least the description of, and distinction between, Strata 2 and 3 should be made clear, since this is a general "what comes in"/"what comes out" description which should surely be included.

In the documentation too, the strata could be sign-posted earlier, so as to prevent confusion. Currently, Stratum 4 is only included in the API documentation, and therefore rather easy to miss. Stratum 2 and 3 are rather easy to get confused between at the moment. https://pyndl.readthedocs.io/en/latest/quickstart.html#correct-data-format deals with both. Here, the domain-purpose of the example datasets should be explained at the beginning of each example. Additionally, Strata 2 and 3 should probably be introduced briefly at the beginning of the whole before going into the examples. "Correct Data Format" => "Data Preparation". Exposing and documenting Stratum 1 may be a good longer term goal.

The possibility of calling from other languages is mentioned in the paper, however the bridge software is not mentioned by name. This should either be made clear or the sentence removed. I would favour the latter since this is really true of any Python library (in my experience it turns out to be less convenient than switching languages or using multiple languages in the repository). It would only really be worth calling attention to if you had packaged or wrapped the library specifically for these other languages. There does seem to be an example here https://pyndl.readthedocs.io/en/latest/misc.html but it looks incomplete, and people might not read anything in a section entitled "misc".

In my attempts to test the functional claims of the paper, I came across this issue https://github.com/quantling/pyndl/issues/220 . This got me thinking as to how well the package deals with Unicode in general. In particular, with things like extracting bigrams, the most reasonable thing to may most usually be to work with grapheme clusters. Note this is different from code points, which is what you will get if you work with Python 3/unicode strings directly -- if you are working with code points then you should probably switch to grapheme clusters. It might not really make sense to extract bigrams with e.g. Chinese characters as whole grapheme clusters but rather to decompose them into radicals. I'm not sure what to do about this except for document what you are doing at the moment or add some kind of caveats section to the documentation and note that multilinguality is a bit of an open box in the paper, underlining that support for ideograms / pictograms / logograms is not really there. I do think this is somewhat important given the emphasis on multilinguality in the paper.

The language of the paper is good but could still be revised a little more. If it is possible to get someone outside the work or even the field to go over it, this usually improves clarity. Here are some things I noticed:

Issues I believe should be addressed before publication

  1. At least one of https://github.com/quantling/pyndl/issues/211 and https://github.com/quantling/pyndl/issues/212
  2. https://github.com/quantling/pyndl/issues/208
  3. https://github.com/quantling/pyndl/issues/217
  4. https://github.com/quantling/pyndl/issues/219
  5. Putting some minimal level of benchmark in the paper (See: https://github.com/quantling/pyndl/issues/215 )
  6. Revising paper as mentioned above
  7. https://github.com/quantling/pyndl/issues/220
  8. Making sure you are working with grapheme clusters
  9. Noting lack of support for ideograms / pictograms / logograms
  10. Cite xarray

Issues which would ideally be addressed reasonably quickly but need not necessarily block publication

  1. Putting benchmarks in the documentation (See: https://github.com/quantling/pyndl/issues/215 )
  2. https://github.com/quantling/pyndl/issues/218

Long-term/roadmap issues

  1. https://github.com/quantling/pyndl/issues/130
  2. https://github.com/quantling/pyndl/issues/213
VenkteshV commented 2 years ago

@VenkteshV, can you please update us on how it's going with your review?

Greetings. I am working on the same and will complete the first round of reviews by this week.

Apologies. I got delayed. I am working on the same and will rap it up in few days.

I am nearly done. Will submit my reviews by 10th of August.

derNarr commented 2 years ago

@frankier thank you for your review and your good feedback. We are addressing some of your suggestions right now and might reach out to your for clarification on the others when we finished implementing the first round of improvements. Especially, on points 8. and 9. regarding the grapheme clusters and the support for ideograms / pictograms / logograms I would be happy about some more suggestions and clarifications. I will reach out to you regarding these points after we have implemented the low hanging improvements.

VenkteshV commented 2 years ago

@editorialbot generate pdf

editorialbot commented 2 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

VenkteshV commented 2 years ago

I am finishing up my reviews. I have some recommendations for authors (minor) on the writing part:

1) You could consider making the statement of need a bit more detailed. The summary stresses the non-english support component. However, I find this missing in the statement of need. More emphasis can also be made on size of corpora the algorithm can handle. 2) You could also consider splitting the second paragraph into listing existing software and stating the clear differences between pyndl and others. @derNarr

osorensen commented 2 years ago

THanks a lot @VenkteshV! Could you please also update your checklist?

VenkteshV commented 2 years ago

THanks a lot @VenkteshV! Could you please also update your checklist?

Yes. Apologies for the delay. Will do them by today. I have some reviews regarding the library too. Just revising them and will update the checklist and my reviews accordingly. Thank you.

osorensen commented 2 years ago

@derNarr, could you please give us an approximate timeline for when the issues raised by the reviewers will be addressed? It is very fine if some of it requires some time, but if so, please keep us updated regularly in this thread.

derNarr commented 2 years ago

@osorensen , I am handing in writing my PhD thesis and will submit it in mid September. Therefore, I will addressing the issues of the reviewers in the end of September / beginning of October. I missed that @VenkteshV 's evaluation of the code is in issue https://github.com/quantling/pyndl/issues/226 and was still waiting for it.

Thanks to all of you for the good suggestions and the reviews. We already started addressing some of the issues raised and I will write some prose here, when we think that we have made progress in addressing some of these issues.

osorensen commented 2 years ago

Thanks @derNarr. This sounds good. Best of luck with your dissertation.

derNarr commented 2 years ago

@osorensen short update: I have to do some (small) corrections to my dissertation and was sick for two weeks in October. Therefore, I will start working on the review in the beginning of November. Then I will review, which parts we have already addressed by then, and give another update here.

osorensen commented 2 years ago

That's fine @derNarr. Thanks a lot for updating us.

derNarr commented 1 year ago

Dear @osorensen, dear @frankier, dear @VenkteshV , dear @jinhangjiang ,

we have addressed now all of the urgent issues and improved the paper and the package documentation according to your comments. We especially oriented ourselves along this comment .

The documentation of pyndl is substantially improved, we added a small benchmark and the code to run it, we made more clear that pyndl is mainly about a learning algorithm but that the Unicode support in Python3/Cython is a major advantage about the difficulties of properly processing Unicode in the R/C++ language. The claim of the "multi-language" support comes mainly from this fact, as it was for most quantitative linguists too painful to debug their code in R/C++ with ndl2. (Sometimes it worked flawlessly though even in R, which might have dependent on the installed locals of the machine.)

We added a citation to xarray and sanitized the paper to use US-style English and proper idiomatic expression. Furthermore, we clarified that the main focus is on grapheme clusters, but that other input formats even like speech audio recordings can be processed with pyndl, but only with custom preprocessing pipelines. We believe the same is true for ideograms / pictograms/ logograms.

In the acknowledgements, we added a sentence thanking the reviewers and acknowledging good review process (@frankier , @VenkteshV , @jinhangjiang are you okay with this?):

Finally, this paper and the associated package benefited from constructive and conscientious peer review. We would like to thank our three reviewers, Venktesh V, Jinhang Jiang, and especially Frankie Robertson, for their constructive and in-depth feedback and their suggestions on how to make the package more user-friendly.

During our clean up, we renamed the master branch to main and renamed the folder containing the documentation from doc to docs. We published a new version of pyndl in version 1.0.0, which now supplies a wheel for amd64 Linux systems.

How should we proceed now?

Best regards,

David-Elias, Marc, Elnaz, and Tino

derNarr commented 1 year ago

I edited the post, there was an old version of the quote in the original message.

jinhangjiang commented 1 year ago

Dear @derNarr,

Thank you for completing the draft and fixing the issues accordingly. The acknowledgment looks good to me. It is greatly appreciated. I am glad you found my input useful.

Also, thank you again for creating this python package to help the community.

Best, Jinhang

frankier commented 1 year ago

@editorialbot generate pdf

editorialbot commented 1 year ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

frankier commented 1 year ago

@editorialbot check references @editorialbot check repository

editorialbot commented 1 year ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.5334/jors.148 is OK
- 10.21236/ad0241531 is OK
- 10.1016/s0022-2496(02)00016-0 is OK
- 10.1037/a0023851 is OK
- 10.1080/23273798.2021.1954207 is OK
- 10.1017/s0022226719000203 is OK
- 10.3389/fcomm.2020.00017 is OK
- 10.21437/interspeech.2018-2420 is OK
- 10.1111/stan.12134 is OK
- 10.1080/23273798.2020.1815813 is OK
- 10.1515/cog-2021-0006 is OK
- 10.1515/9783110292022-006 is OK
- 10.1371/journal.pone.0174623 is OK
- 10.1037/0003-066x.43.3.151 is OK
- 10.31234/osf.io/prvzq is OK

MISSING DOIs

- None

INVALID DOIs

- None
danielskatz commented 1 year ago

FYI, @editorialbot only sees a command if is the first thing in a comment, and thus also only one command can be in a comment

frankier commented 1 year ago

@editorialbot check repository

(Thanks for the heads-up)

editorialbot commented 1 year ago
Software report:

github.com/AlDanial/cloc v 1.88  T=0.14 s (440.1 files/s, 69176.0 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Python                          25           1167           1226           3472
reStructuredText                15            541            589            921
Cython                           5            129             81            568
TeX                              1             19              0            214
Markdown                         3             33              0            193
YAML                             5             23             18            106
TOML                             1             13              0             60
JSON                             1              0              0             58
SQL                              1              0              0             51
R                                1              6              3             32
DOS Batch                        1              8              1             27
make                             1              4              6             10
Bourne Shell                     1              2              1              6
-------------------------------------------------------------------------------
SUM:                            61           1945           1925           5718
-------------------------------------------------------------------------------

gitinspector failed to run statistical information for the repository
editorialbot commented 1 year ago

Wordcount for paper.md is 1423

frankier commented 1 year ago

Okay! I think the paper is probably a bit too long by JOSS standards now, which is 500-1000 words. (I think a little bit over 1000 words is okay, but it seems like it's probably too much now.) If it were possible to fit things on two pages that might be ideal. I suppose this is an editorial issue and so might be beyond my remit and not quite sure how much this is enforced.

It's possible the plot could be made a bit smaller, which may help with reducing the number of pages (but not the number of words). It might be possible to fit it on a single pane, given that there need only be: ndl (1 job), ndl2 (1 job), ndl2 (2 jobs), pyndl (1 job), pyndl (2 jobs, openMP) pyndl (2 jobs, threading). You may need to move the legend outside the plotting area but it might still save a tiny bit of space.

The acknowledgement is appreciated, but shortening/removing it helps you get to JOSS editorial standards it's fine by me.

I agree that everything in my review has been addressed, and so I recommend that it be accepted, possibly pending some edits to the paper for length.