Closed editorialbot closed 1 year ago
Hello humans, I'm @editorialbot, a robot that can help you with some common editorial tasks.
For a list of things I can do to help you, just type:
@editorialbot commands
For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:
@editorialbot generate pdf
Software report:
github.com/AlDanial/cloc v 1.88 T=0.07 s (783.4 files/s, 126191.2 lines/s)
-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
Python 22 1125 1219 3321
reStructuredText 12 463 484 743
Cython 5 129 81 568
TeX 1 17 0 182
Markdown 3 26 0 151
INI 1 8 0 63
YAML 5 10 11 60
JSON 1 0 0 55
SQL 1 0 0 51
make 2 7 6 35
DOS Batch 1 8 1 27
Bourne Shell 1 2 1 6
-------------------------------------------------------------------------------
SUM: 55 1795 1803 5262
-------------------------------------------------------------------------------
gitinspector failed to run statistical information for the repository
Wordcount for paper.md
is 999
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):
OK DOIs
- 10.1016/s0022-2496(02)00016-0 is OK
- 10.1037/a0023851 is OK
- 10.1080/23273798.2021.1954207 is OK
- 10.1017/s0022226719000203 is OK
- 10.3389/fcomm.2020.00017 is OK
- 10.21437/interspeech.2018-2420 is OK
- 10.1111/stan.12134 is OK
- 10.1080/23273798.2020.1815813 is OK
- 10.1515/cog-2021-0006 is OK
- 10.1515/9783110292022-006 is OK
- 10.31234/osf.io/prvzq is OK
MISSING DOIs
- 10.21236/ad0241531 may be a valid DOI for title: Adaptive switching circuits
- 10.1037/0003-066x.43.3.151 may be a valid DOI for title: Pavlovian conditioning. It’s not what you think it is.
INVALID DOIs
- None
:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:
Added missing DOIs and checked that they lead to the right reference.
@editorialbot generate pdf
@editorialbot generate pdf
:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:
@editorialbot commands
Hello @derNarr, here are the things you can ask me to do:
# List all available commands
@editorialbot commands
# Get a list of all editors's GitHub handles
@editorialbot list editors
# Check the references of the paper for missing DOIs
@editorialbot check references
# Perform checks on the repository
@editorialbot check repository
# Adds a checklist for the reviewer using this command
@editorialbot generate my checklist
# Set a value for branch
@editorialbot set joss-paper as branch
# Generates the pdf paper
@editorialbot generate pdf
# Get a link to the complete list of reviewers
@editorialbot list reviewers
@editorialbot check references
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):
OK DOIs
- 10.21236/ad0241531 is OK
- 10.1016/s0022-2496(02)00016-0 is OK
- 10.1037/a0023851 is OK
- 10.1080/23273798.2021.1954207 is OK
- 10.1017/s0022226719000203 is OK
- 10.3389/fcomm.2020.00017 is OK
- 10.21437/interspeech.2018-2420 is OK
- 10.1111/stan.12134 is OK
- 10.1080/23273798.2020.1815813 is OK
- 10.1515/cog-2021-0006 is OK
- 10.1515/9783110292022-006 is OK
- 10.1037/0003-066x.43.3.151 is OK
- 10.31234/osf.io/prvzq is OK
MISSING DOIs
- None
INVALID DOIs
- None
@jinhangjiang can you please update us on how it's going with your review?
Yeah, I will have it done by this Sunday night CST.
Sent from my iPhone
On Jul 15, 2022, at 13:50, Øystein Sørensen @.***> wrote:
@jinhangjiang can you please update us on how it's going with your review?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.
:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:
Hi @osorensen, I have finished the first round of reviews. It looks like a mature, solid, and domain-dependent python package to me. The authors are able to provide sufficient documents to help users to understand what it can achieve. The example codes are reproducible from my end.
I only have two questions for the authors (@derNarr):
First, based on what I read, the API only takes the data in a specific format. And it looks hard to scale up. Is it a normal convention in this field? Since one of the claimed contributions is to help study large corpora, it would be nice to explain a bit how well it can work with large text data.
Second, since it can work with non-English text, it would be nice to give some showcases. Also, I was not able to locate a list of supported languages.
Please let me know what the next step is. Thanks!
Thanks a lot for your review, @jinhangjiang. Since you have completed your checklist, you are done with your tasks now. However, I hope the author @derNarr can answer your questions here, and if possibly include your suggested improvements in the package.
@VenkteshV, can you please update us on how it's going with your review?
* First, based on what I read, the API only takes the data in a specific format. And it looks hard to scale up. Is it a normal convention in this field? Since one of the claimed contributions is to help study large corpora, it would be nice to explain a bit how well it can work with large text data.
The ndl.ndl
learner expects the data to be in a tab-separated format where in the first column cues (inputs) are defined, where the only special delimiting character is the underscore _
. In the second column the outcomes (outputs) are defined in the same way. This tab-separated file can / should be gunzipped (deflate compressed). As you can continue the learning, learning events can be split up to multiple event files.
The advantage and the origin of this event file format is that it is extremely transparent to the researcher, easy to inspect and that the time course of the learning events are defined by going from top to bottom in the file. As the research in our group was always concerned with the ordering of learning events and found some evidence that a naturalistic learning order is superior to random learning in modeling human reaction times. Naturalistic learning events usually have the same cue (or outcome) reappear in one phase of the learning a lot, whereas it might not be present at all present in the rest of the learning events. An example would be the cue/outcome "pyndl", which nearly only appears in this discussion and in research that uses this package, but you will not be exposed tot the word "pyndl" randomly on the Internet.
To be honest, the data format was in use before I joined the group and wrote pyndl to compute the weight matrices from this data format. But I think, it is extremely easy, allows for transparent use, should be accessible and readable in 20 to 50 years from now on, is around for at least 10 years. Therefore, it has many good features and is only slightly bloated. With pyndl.corpus
there exists some code to create event files from different sources like OpenSubtiltes files or simple text files.
The data format runs with 100,000,000 (100 million) learning events, and I don't see a reason, why it not should scale up further. Compared to image, video and audio, text is small. @jinhangjiang If you still have the concern, can you specify why you believe it would not scale up? What data do you have in mind with which sizes?
* Second, since it can work with non-English text, it would be nice to give some showcases. Also, I was not able to locate a list of supported languages.
Everything that can be encoded into letters with Unicode code points (utf-8, utf-16, ...) can be computed with pyndl
. You are right that we should add a small example that uses Chinese or Greek or Taiwanese symbols. I created an issue for that in pyndl https://github.com/quantling/pyndl/issues/214
We stress this point so much in the paper, as with the older implementations in the R programming language we regularly ran into problems with the Unicode processing, which was done in with libicu
, and therefore it was a big deal for us to be able to run the learning on all the languages we wanted. The only thing we still needed to do is convert texts to utf-8 or read them into python 3.
@jinhangjiang thank you for your questions and your review! Did I answered you questions? Do you suggest that we should change something or have other ideas of improvements? Let me know as well, if I should elaborate on one end or another.
@derNarr Thank you for answering my questions throughout. I really appreciate the details you shared. When I said I thought it was hard to scale up, I was thinking about how to prepare the inputs in something like pandas/spark dataframe as I am more familiar with those frameworks. What you have explained makes sense to me. As you have experimented with 100 million learning events, I would suggest sharing some examples of estimated execution time with corresponding data volume. If I were about to use the package in my research, I would be interested in knowing this information in advance.
Looking forward to the small examples in different languages such as Chinese. It would be very interesting to see how it works with characters instead of letters.
Thank you for all your work and clarifications. Best.
@VenkteshV, can you please update us on how it's going with your review?
Greetings. I am working on the same and will complete the first round of reviews by this week.
@VenkteshV, can you please update us on how it's going with your review?
Greetings. I am working on the same and will complete the first round of reviews by this week.
Apologies. I got delayed. I am working on the same and will rap it up in few days.
I will start with a general review summary, before prioritizing the some of the GitHub issues (I have taken the liberty of including some issues not originally filed by myself here). I have expressed it in terms of what is needed for acceptance but of course there is always room for discussion.
The library looks to be quite mature overall, and have a clear potential target audience within the field of cognitive linguistics.
I think the statement of need could be a bit stronger. Some of the advantages mentioned such as maintainability are really advantages for the library author and not the library user.
A key advantage of using Python is interoperability with the Python ecosystem, however, the library does not exactly follow the conventions of any existing library, opting for its own custom file-based workflow. This may be fine for the purposes of this library since it is in line with the expectation of the audience, but on the other hand much of the reusability and convenience in the Python ecosystem comes from people reusing certain idioms (e.g. that of sklearn as mentioned here https://github.com/quantling/pyndl/issues/130 ). Perhaps you could also mention the usage of/cite xarray, since this seems to be a slightly higher level package which is in use.
At a high level, I see there being 4 layers/strata in the project from the lowest:
I think these strata -- and therefore different ways of using the library/workflow -- should be outlined in the paper. At least the description of, and distinction between, Strata 2 and 3 should be made clear, since this is a general "what comes in"/"what comes out" description which should surely be included.
In the documentation too, the strata could be sign-posted earlier, so as to prevent confusion. Currently, Stratum 4 is only included in the API documentation, and therefore rather easy to miss. Stratum 2 and 3 are rather easy to get confused between at the moment. https://pyndl.readthedocs.io/en/latest/quickstart.html#correct-data-format deals with both. Here, the domain-purpose of the example datasets should be explained at the beginning of each example. Additionally, Strata 2 and 3 should probably be introduced briefly at the beginning of the whole before going into the examples. "Correct Data Format" => "Data Preparation". Exposing and documenting Stratum 1 may be a good longer term goal.
The possibility of calling from other languages is mentioned in the paper, however the bridge software is not mentioned by name. This should either be made clear or the sentence removed. I would favour the latter since this is really true of any Python library (in my experience it turns out to be less convenient than switching languages or using multiple languages in the repository). It would only really be worth calling attention to if you had packaged or wrapped the library specifically for these other languages. There does seem to be an example here https://pyndl.readthedocs.io/en/latest/misc.html but it looks incomplete, and people might not read anything in a section entitled "misc".
In my attempts to test the functional claims of the paper, I came across this issue https://github.com/quantling/pyndl/issues/220 . This got me thinking as to how well the package deals with Unicode in general. In particular, with things like extracting bigrams, the most reasonable thing to may most usually be to work with grapheme clusters. Note this is different from code points, which is what you will get if you work with Python 3/unicode strings directly -- if you are working with code points then you should probably switch to grapheme clusters. It might not really make sense to extract bigrams with e.g. Chinese characters as whole grapheme clusters but rather to decompose them into radicals. I'm not sure what to do about this except for document what you are doing at the moment or add some kind of caveats section to the documentation and note that multilinguality is a bit of an open box in the paper, underlining that support for ideograms / pictograms / logograms is not really there. I do think this is somewhat important given the emphasis on multilinguality in the paper.
The language of the paper is good but could still be revised a little more. If it is possible to get someone outside the work or even the field to go over it, this usually improves clarity. Here are some things I noticed:
@VenkteshV, can you please update us on how it's going with your review?
Greetings. I am working on the same and will complete the first round of reviews by this week.
Apologies. I got delayed. I am working on the same and will rap it up in few days.
I am nearly done. Will submit my reviews by 10th of August.
@frankier thank you for your review and your good feedback. We are addressing some of your suggestions right now and might reach out to your for clarification on the others when we finished implementing the first round of improvements. Especially, on points 8. and 9. regarding the grapheme clusters and the support for ideograms / pictograms / logograms I would be happy about some more suggestions and clarifications. I will reach out to you regarding these points after we have implemented the low hanging improvements.
@editorialbot generate pdf
:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:
I am finishing up my reviews. I have some recommendations for authors (minor) on the writing part:
1) You could consider making the statement of need a bit more detailed. The summary stresses the non-english support component. However, I find this missing in the statement of need. More emphasis can also be made on size of corpora the algorithm can handle. 2) You could also consider splitting the second paragraph into listing existing software and stating the clear differences between pyndl and others. @derNarr
THanks a lot @VenkteshV! Could you please also update your checklist?
THanks a lot @VenkteshV! Could you please also update your checklist?
Yes. Apologies for the delay. Will do them by today. I have some reviews regarding the library too. Just revising them and will update the checklist and my reviews accordingly. Thank you.
@derNarr, could you please give us an approximate timeline for when the issues raised by the reviewers will be addressed? It is very fine if some of it requires some time, but if so, please keep us updated regularly in this thread.
@osorensen , I am handing in writing my PhD thesis and will submit it in mid September. Therefore, I will addressing the issues of the reviewers in the end of September / beginning of October. I missed that @VenkteshV 's evaluation of the code is in issue https://github.com/quantling/pyndl/issues/226 and was still waiting for it.
Thanks to all of you for the good suggestions and the reviews. We already started addressing some of the issues raised and I will write some prose here, when we think that we have made progress in addressing some of these issues.
Thanks @derNarr. This sounds good. Best of luck with your dissertation.
@osorensen short update: I have to do some (small) corrections to my dissertation and was sick for two weeks in October. Therefore, I will start working on the review in the beginning of November. Then I will review, which parts we have already addressed by then, and give another update here.
That's fine @derNarr. Thanks a lot for updating us.
Dear @osorensen, dear @frankier, dear @VenkteshV , dear @jinhangjiang ,
we have addressed now all of the urgent issues and improved the paper and the package documentation according to your comments. We especially oriented ourselves along this comment .
The documentation of pyndl is substantially improved, we added a small benchmark and the code to run it, we made more clear that pyndl is mainly about a learning algorithm but that the Unicode support in Python3/Cython is a major advantage about the difficulties of properly processing Unicode in the R/C++ language. The claim of the "multi-language" support comes mainly from this fact, as it was for most quantitative linguists too painful to debug their code in R/C++ with ndl2. (Sometimes it worked flawlessly though even in R, which might have dependent on the installed locals of the machine.)
We added a citation to xarray and sanitized the paper to use US-style English and proper idiomatic expression. Furthermore, we clarified that the main focus is on grapheme clusters, but that other input formats even like speech audio recordings can be processed with pyndl, but only with custom preprocessing pipelines. We believe the same is true for ideograms / pictograms/ logograms.
In the acknowledgements, we added a sentence thanking the reviewers and acknowledging good review process (@frankier , @VenkteshV , @jinhangjiang are you okay with this?):
Finally, this paper and the associated package benefited from constructive and conscientious peer review. We would like to thank our three reviewers, Venktesh V, Jinhang Jiang, and especially Frankie Robertson, for their constructive and in-depth feedback and their suggestions on how to make the package more user-friendly.
During our clean up, we renamed the master
branch to main
and renamed the folder containing the documentation from doc
to docs
. We published a new version of pyndl in version 1.0.0, which now supplies a wheel for amd64 Linux
systems.
How should we proceed now?
Best regards,
David-Elias, Marc, Elnaz, and Tino
I edited the post, there was an old version of the quote in the original message.
Dear @derNarr,
Thank you for completing the draft and fixing the issues accordingly. The acknowledgment looks good to me. It is greatly appreciated. I am glad you found my input useful.
Also, thank you again for creating this python package to help the community.
Best, Jinhang
@editorialbot generate pdf
:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:
@editorialbot check references @editorialbot check repository
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):
OK DOIs
- 10.5334/jors.148 is OK
- 10.21236/ad0241531 is OK
- 10.1016/s0022-2496(02)00016-0 is OK
- 10.1037/a0023851 is OK
- 10.1080/23273798.2021.1954207 is OK
- 10.1017/s0022226719000203 is OK
- 10.3389/fcomm.2020.00017 is OK
- 10.21437/interspeech.2018-2420 is OK
- 10.1111/stan.12134 is OK
- 10.1080/23273798.2020.1815813 is OK
- 10.1515/cog-2021-0006 is OK
- 10.1515/9783110292022-006 is OK
- 10.1371/journal.pone.0174623 is OK
- 10.1037/0003-066x.43.3.151 is OK
- 10.31234/osf.io/prvzq is OK
MISSING DOIs
- None
INVALID DOIs
- None
FYI, @editorialbot only sees a command if is the first thing in a comment, and thus also only one command can be in a comment
@editorialbot check repository
(Thanks for the heads-up)
Software report:
github.com/AlDanial/cloc v 1.88 T=0.14 s (440.1 files/s, 69176.0 lines/s)
-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
Python 25 1167 1226 3472
reStructuredText 15 541 589 921
Cython 5 129 81 568
TeX 1 19 0 214
Markdown 3 33 0 193
YAML 5 23 18 106
TOML 1 13 0 60
JSON 1 0 0 58
SQL 1 0 0 51
R 1 6 3 32
DOS Batch 1 8 1 27
make 1 4 6 10
Bourne Shell 1 2 1 6
-------------------------------------------------------------------------------
SUM: 61 1945 1925 5718
-------------------------------------------------------------------------------
gitinspector failed to run statistical information for the repository
Wordcount for paper.md
is 1423
Okay! I think the paper is probably a bit too long by JOSS standards now, which is 500-1000 words. (I think a little bit over 1000 words is okay, but it seems like it's probably too much now.) If it were possible to fit things on two pages that might be ideal. I suppose this is an editorial issue and so might be beyond my remit and not quite sure how much this is enforced.
It's possible the plot could be made a bit smaller, which may help with reducing the number of pages (but not the number of words). It might be possible to fit it on a single pane, given that there need only be: ndl (1 job), ndl2 (1 job), ndl2 (2 jobs), pyndl (1 job), pyndl (2 jobs, openMP) pyndl (2 jobs, threading). You may need to move the legend outside the plotting area but it might still save a tiny bit of space.
The acknowledgement is appreciated, but shortening/removing it helps you get to JOSS editorial standards it's fine by me.
I agree that everything in my review has been addressed, and so I recommend that it be accepted, possibly pending some edits to the paper for length.
Submitting author: !--author-handle-->@derNarr<!--end-author-handle-- (Konstantin Sering) Repository: https://github.com/quantling/pyndl Branch with paper.md (empty if default branch): paper Version: v1.1.1 Editor: !--editor-->@osorensen<!--end-editor-- Reviewers: @frankier, @jinhangjiang, @VenkteshV Archive: 10.5281/zenodo.7410272
Status
Status badge code:
Reviewers and authors:
Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)
Reviewer instructions & questions
@frankier & @jinhangjiang & @VenkteshV, your review will be checklist based. Each of you will have a separate checklist that you should update when carrying out your review. First of all you need to run this command in a separate comment to create the checklist:
The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @osorensen know.
✨ Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest ✨
Checklists
📝 Checklist for @frankier
📝 Checklist for @VenkteshV
📝 Checklist for @jinhangjiang