[PRE REVIEW]: CleanX: A Python library for data cleaning of large sets of X-rays

whedon commented 3 years ago

Submitting author: @drcandacemakedamoore (Candace Makeda Moore) Repository: https://github.com/drcandacemakedamoore/cleanX Version: v0.0.7 Editor: Pending Reviewer: Pending Managing EiC: Kevin M. Moerman

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

Author instructions

Thanks for submitting your paper to JOSS @drcandacemakedamoore . Currently, there isn't an JOSS editor assigned to your paper.

@drcandacemakedamoore if you have any suggestions for potential reviewers then please mention them here in this thread (without tagging them with an @). In addition, this list of people have already agreed to review for JOSS and may be suitable for this submission (please start at the bottom of the list).

Editor instructions

The JOSS submission bot @whedon is here to help you find and assign reviewers and start the main review. To find out what @whedon can do for you type:

@whedon commands

whedon commented 3 years ago

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks.

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf

whedon commented 3 years ago

Software report (experimental):

github.com/AlDanial/cloc v 1.88  T=0.04 s (386.2 files/s, 54696.2 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Python                           4            307            468            763
Markdown                         3            156              0            278
YAML                             3              9              5            124
TeX                              1              9              0             54
DOS Batch                        2              8              1             28
reStructuredText                 1             10              5             19
make                             1              4              7              9
Bourne Shell                     1              0              0              2
-------------------------------------------------------------------------------
SUM:                            16            503            486           1277
-------------------------------------------------------------------------------

Statistical information for the repository 'aaa9e70b3f4335912a81ce9f' was
gathered on 2021/04/30.
The following historical commit information, by author, was found:

Author                     Commits    Insertions      Deletions    % of changes
Candace Makeda Moore            71         15725          14330           99.22
Oleg Sivokon                     5           184             41            0.74
andrew-f-murphy                  2             5              5            0.03

Below are the number of rows from each author that have survived and are still
intact in the current revision:

Author                     Rows      Stability          Age       % in comments
Candace Makeda Moore       1381            8.8          0.2                9.49
Oleg Sivokon                153           83.2          0.2               21.57
andrew-f-murphy               4           80.0          0.2              100.00

whedon commented 3 years ago

PDF failed to compile for issue #3232 with the following error:

 /app/vendor/bundle/ruby/2.6.0/bundler/gems/whedon-92346a0773a4/lib/whedon.rb:204:in `block in parse_authors': Author (Oleg Sivokon) is missing affiliation (RuntimeError)
    from /app/vendor/bundle/ruby/2.6.0/bundler/gems/whedon-92346a0773a4/lib/whedon.rb:202:in `each'
    from /app/vendor/bundle/ruby/2.6.0/bundler/gems/whedon-92346a0773a4/lib/whedon.rb:202:in `parse_authors'
    from /app/vendor/bundle/ruby/2.6.0/bundler/gems/whedon-92346a0773a4/lib/whedon.rb:93:in `initialize'
    from /app/vendor/bundle/ruby/2.6.0/bundler/gems/whedon-92346a0773a4/lib/whedon/processor.rb:38:in `new'
    from /app/vendor/bundle/ruby/2.6.0/bundler/gems/whedon-92346a0773a4/lib/whedon/processor.rb:38:in `set_paper'
    from /app/vendor/bundle/ruby/2.6.0/bundler/gems/whedon-92346a0773a4/bin/whedon:58:in `prepare'
    from /app/vendor/bundle/ruby/2.6.0/gems/thor-0.20.3/lib/thor/command.rb:27:in `run'
    from /app/vendor/bundle/ruby/2.6.0/gems/thor-0.20.3/lib/thor/invocation.rb:126:in `invoke_command'
    from /app/vendor/bundle/ruby/2.6.0/gems/thor-0.20.3/lib/thor.rb:387:in `dispatch'
    from /app/vendor/bundle/ruby/2.6.0/gems/thor-0.20.3/lib/thor/base.rb:466:in `start'
    from /app/vendor/bundle/ruby/2.6.0/bundler/gems/whedon-92346a0773a4/bin/whedon:131:in `<top (required)>'
    from /app/vendor/bundle/ruby/2.6.0/bin/whedon:23:in `load'
    from /app/vendor/bundle/ruby/2.6.0/bin/whedon:23:in `<main>'

whedon commented 3 years ago

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1145/2207243.2207253 is OK

MISSING DOIs

- 10.1201/9780429200717-7 may be a valid DOI for title: Tidy data
- 10.1007/978-3-319-94878-2_6 may be a valid DOI for title: A Standardised Approach for Preparing Imaging Data for Machine Learning Tasks in Radiology.
- 10.1007/s00330-020-07453-w may be a valid DOI for title: COVID-19, AI enthusiasts, and toy datasets: radiology without radiologists

INVALID DOIs

- None

Kevin-Mattheus-Moerman commented 3 years ago

@drcandacemakedamoore thanks for this submission. Can you check the above :point_up: error in relation to your paper compilation? It seems an affiliation field has been left empty.

Kevin-Mattheus-Moerman commented 3 years ago

@drcandacemakedamoore unfortunately none of our editors in the domains of image processing or machine learning are currently available to handle this work. Hence I've labelled this issue as waitlisted, which means we will resume the handling/reviewing of this work once one of our related editors becomes available. Thanks for your patience.

Kevin-Mattheus-Moerman commented 3 years ago

@jgostick I see you just finished editing https://github.com/openjournals/joss-reviews/issues/3045, which sounds related (although it is imaging rather than spectroscopy) to this submission. Is this work also something you could help with too? Thanks

jgostick commented 3 years ago

Ok...but first I think we need to do a "scope query".

Kevin-Mattheus-Moerman commented 3 years ago

@whedon query scope

whedon commented 3 years ago

Submission flagged for editorial review.

drcandacemakedamoore commented 3 years ago

I will check the MISSING DOIs, and try to update them today. There is no missing affiliation, Oleg Sivokon's affiliation is left blank, as he requested.

drcandacemakedamoore commented 3 years ago

I see it makes problems to leave an affiliation blank. I have added one and fixed the other problems. I have edited the DOIs in the bib file, and recompiled the paper without problems on the Whedon paper preview service.

drcandacemakedamoore commented 3 years ago

This is just a message to let you know that I think I fixed the problems with the submission. I see it made problems to leave an affiliation blank so we have added one for Oleg Sivokon. I have edited the DOIs in the bib file, and recompiled the paper without problems on the Whedon paper preview service. Please let me know if there is human who sees this email. Dr. Candace Makeda Moore, MD

On Sat, May 1, 2021 at 8:01 PM whedon @.***> wrote:

Submission flagged for editorial review.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openjournals/joss-reviews/issues/3232#issuecomment-830662253, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHMYSAPJ6QQLXTSJKCXXJB3TLQXYJANCNFSM434XHS6A .

jgostick commented 3 years ago

Hi @drcandacemakedamoore...github puts your email into the issue thread, so is shows up here.

FYI, we have 'flagged' your submission for editorial review, which means the editors are deciding if it meets the threshold in terms of size and effort. (We need to screen out small packages that are just a few functions.) It seems that your package is a bit borderline in terms of size. 700 lines of Python is not usually considered enough. I have looked through your code base and I'm on the fence. There are certainly some useful functions in there, but I'm not sure they represent a "significant scholarly effort". Some of your functions are just one line wrappers.

I also noticed that your package is lacking in a few areas. For instance, the docs are not rendering well, the readme is pretty cluttered, and you have all the code in a single file at the top level, rather than broken up into submodules. This is not strictly necessary, but typically code would be in a folder like cleanx, and there'd be several files in there with functions broken up by category, then you'd tell python to import them as different modules, like cleanx.tools.proportions_ht_wt_to_histo and cleanx.dataframes.dataframe_up_my_pics and cleanx.find.suspect_text_by_legnth. This sort of categorization makes the package more sensible to user. Also note that I actually cut&pasted that last one and noticed that there is a typo in "length". This all just generally gives me the feeling that the package is not yet 100% ready.

So, perhaps you can convince me/us that this package is fully mature and does represent a significant effort? What are your plans for this in the future? Are you finished developing it, or is this a budding project that will continue to grow?

drcandacemakedamoore commented 3 years ago

I'll adress all your questions in my next correspondance, but can you first just let me know if by the docs you mean the documentation at https://drcandacemakedamoore.github.io/cleanX/ , and what you mean by not rendering well? I put the documentation there (https://drcandacemakedamoore.github.io/cleanX/ ) due to some technical issues with readthedocs which made it not render on readthedocs.

drcandacemakedamoore commented 3 years ago

We have fixed the rendering of documentation, and tried to take care of not only all typos, but variable names that contained problematic spelling. Our interaction with you has certainly raised the quality of what we are making. You may have noticed we had an open issue around templates. We are adding functions related to template finding that are much more specific and complex than simple wrappers. Due to these additions, the code is now longer than 700 lines, and includes a new class. We do try to code as efficiently as possible, so we think the package should not necessarily be thousands of lines to accomplish it's goals. To my knowledge, we are tackling novel issues for open source software in medical imaging AI, an example from our newer code is facilitating creating rotationally invariant template maching based on contours. We also tackle open issues currently scholarship in the area of AI for imaging such as bias against specific protected groups ( an interesting paper on this is Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis. Agostina J. Larrazabal, Nicolás Nieto, Victoria Peterson, Diego H. Milone, Enzo Ferrante. June 12592-12594, 2020, Proceedings of the National Academy of Sciences, Vols. 117 (23) , DOI: 10.1073/pnas.1919012117. ). We believe our code already will be extremely helpful in advancing imaging AI by several mechanisms. One is that it can help people outside major academic institutions in the first world work with existing open datasets, which are known to contain improper data. Please let us know if putting the code into folder with several files would improve our changes of publication. In terms of the lines of code, does the count update automatically? Dr. Candace Makeda Moore, MD

On Fri, May 7, 2021 at 8:59 PM jgostick @.***> wrote:

Hi @drcandacemakedamoore https://github.com/drcandacemakedamoore...github puts your email into the issue thread, so is shows up here.

FYI, we have 'flagged' your submission for editorial review, which means the editors are deciding if it meets the threshold in terms of size and effort. (We need to screen out small packages that are just a few functions.) It seems that your package is a bit borderline in terms of size. 700 lines of Python is not usually considered enough. I have looked through your code base and I'm on the fence. There are certainly some useful functions in there, but I'm not sure they represent a "significant scholarly effort". Some of your functions are just one line wrappers.

I also noticed that your package is lacking in a few areas. For instance, the docs are not rendering well, the readme is pretty cluttered, and you have all the code in a single file at the top level, rather than broken up into submodules. This is not strictly necessary, but typically code would be in a folder like cleanx, and there'd be several files in there with functions broken up by category, then you'd tell python to import them as different modules, like cleanx.tools.proportions_ht_wt_to_histo and cleanx.dataframes.dataframe_up_my_pics and cleanx.find.suspect_text_by_legnth. This sort of categorization makes the package more sensible to user. Also note that I actually cut&pasted that last one and noticed that there is a typo in "length". This all just generally gives me the feeling that the package is not yet 100% ready.

So, perhaps you can convince me/us that this package is fully mature and does represent a significant effort? What are your plans for this in the future? Are you finished developing it, or is this a budding project that will continue to grow?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openjournals/joss-reviews/issues/3232#issuecomment-834658601, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHMYSALKMF447VQGZ6ATHSLTMQTBNANCNFSM434XHS6A .

wvxvw commented 3 years ago

Hi jgostick,

I'm Oleg, I've contributed to the technical aspects of cleanX project (CI, packaging and documentation for the most part). I think, I understand why you would be on the fence about the submission. Recently, when talking to Dr. Moore about the project, I tried to draw an analogy with the famous paper by E. Dijkstra where he first argued the benefits of structured programming. He opened the discussion by mentioning how it would be practically impossible to test the newly designed ALU by verifying manually that all the sums of two-byte integers are correct.

In a very similar way, the task of data cleaning seems to be too vast to have any kind of algorithmic solution. Thus, any library that tries to serve this purpose will be inevitably lacking. While cleanX can be more polished (and we'll definitely get there), it is more important to establish a starting point. It will require a lot more time and experience to establish a methodology, a systematic approach to data cleaning, but, to the best of my knowledge, there isn't really even an attempt to make a library or tool towards that end.

While Dijkstra, in his paper, hadn't written any actual code, he suggested the basic building blocks of control flow that are bread and butter of modern day programming. Even though this paper doesn't have an equivalent of control flow for programming, it's purpose is to open a discussion and, hopefully, eventually find a more systematic way to approach data cleaning.

I think, the paper can be also edited to highlight this aspect: it is not "the library to do data cleaning" it is "what are some problems and some solutions we've discovered when actually doing data cleaning". It should be looked at more as a question and some of the answer than a total answer.

jgostick commented 3 years ago

@whedon generate pdf

whedon commented 3 years ago

PDF failed to compile for issue #3232 with the following error:

 /app/vendor/bundle/ruby/2.6.0/bundler/gems/whedon-92346a0773a4/lib/whedon/author.rb:72:in `block in build_affiliation_string': Problem with affiliations for Candace Moore,, perhaps the affiliations index need quoting? (RuntimeError)
    from /app/vendor/bundle/ruby/2.6.0/bundler/gems/whedon-92346a0773a4/lib/whedon/author.rb:71:in `each'
    from /app/vendor/bundle/ruby/2.6.0/bundler/gems/whedon-92346a0773a4/lib/whedon/author.rb:71:in `build_affiliation_string'
    from /app/vendor/bundle/ruby/2.6.0/bundler/gems/whedon-92346a0773a4/lib/whedon/author.rb:17:in `initialize'
    from /app/vendor/bundle/ruby/2.6.0/bundler/gems/whedon-92346a0773a4/lib/whedon.rb:205:in `new'
    from /app/vendor/bundle/ruby/2.6.0/bundler/gems/whedon-92346a0773a4/lib/whedon.rb:205:in `block in parse_authors'
    from /app/vendor/bundle/ruby/2.6.0/bundler/gems/whedon-92346a0773a4/lib/whedon.rb:202:in `each'
    from /app/vendor/bundle/ruby/2.6.0/bundler/gems/whedon-92346a0773a4/lib/whedon.rb:202:in `parse_authors'
    from /app/vendor/bundle/ruby/2.6.0/bundler/gems/whedon-92346a0773a4/lib/whedon.rb:93:in `initialize'
    from /app/vendor/bundle/ruby/2.6.0/bundler/gems/whedon-92346a0773a4/lib/whedon/processor.rb:38:in `new'
    from /app/vendor/bundle/ruby/2.6.0/bundler/gems/whedon-92346a0773a4/lib/whedon/processor.rb:38:in `set_paper'
    from /app/vendor/bundle/ruby/2.6.0/bundler/gems/whedon-92346a0773a4/bin/whedon:58:in `prepare'
    from /app/vendor/bundle/ruby/2.6.0/gems/thor-0.20.3/lib/thor/command.rb:27:in `run'
    from /app/vendor/bundle/ruby/2.6.0/gems/thor-0.20.3/lib/thor/invocation.rb:126:in `invoke_command'
    from /app/vendor/bundle/ruby/2.6.0/gems/thor-0.20.3/lib/thor.rb:387:in `dispatch'
    from /app/vendor/bundle/ruby/2.6.0/gems/thor-0.20.3/lib/thor/base.rb:466:in `start'
    from /app/vendor/bundle/ruby/2.6.0/bundler/gems/whedon-92346a0773a4/bin/whedon:131:in `<top (required)>'
    from /app/vendor/bundle/ruby/2.6.0/bin/whedon:23:in `load'
    from /app/vendor/bundle/ruby/2.6.0/bin/whedon:23:in `<main>'

wvxvw commented 3 years ago

Hello.

Can you please tell me what Ruby code is responsible for the output above (my understanding is that whedon is responsible for running some sort of cron job, but what's in the job?) The instructions on how to generate PDF that I found here: https://joss.readthedocs.io/en/latest/submitting.html#docker don't seem to be using any Ruby code at all. All I see is the typical xelatex output.

There are some problems with the template (overfull box etc.) which I need to investigate, but it does produce a PDF, and it doesn't complain about affiliation.

wvxvw commented 3 years ago

Also, as described in the linked documentation, I added a GitHub action to generate the paper. It seems to be doing fine (+ I fixed the overfull box issues, which were related to malformed bibliography). You can see the generated PDF here: https://github.com/drcandacemakedamoore/cleanX/actions/runs/829147498 (in the artifacts section).

danielskatz commented 3 years ago

👋 @drcandacemakedamoore - I'm sorry to say that after discussion amongst the JOSS editors, we have decided that this submission does not meet the substantial scholarly effort criterion for review by JOSS. Please see https://joss.readthedocs.io/en/latest/submitting.html#other-venues-for-reviewing-and-publishing-software-packages for other suggestions for how you might receive credit for your work. In addition, please consider the comments from @jgostick. We would look forward to a resubmission of this work in the future, assuming that it adds functionality and uses more standard methods internally, as suggested by @jgostick

danielskatz commented 3 years ago

@whedon reject

whedon commented 3 years ago

Paper rejected.

openjournals / joss-reviews

[PRE REVIEW]: CleanX: A Python library for data cleaning of large sets of X-rays #3232