[REVIEW]: AuDoLab: Automatic document labelling and1 classfication for extremely unbalanced data

whedon commented 3 years ago

Submitting author: @ArneTillmann (Arne Matthias Tillmann) Repository: https://github.com/ArneTillmann/AuDoLab Version: v1.0.7 Editor: @arfon Reviewers: @linuxscout, @pps121 Archive: 10.5281/zenodo.5575835

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

Status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/ac8cf139139dbe55e00e7cc820459cee"><img src="https://joss.theoj.org/papers/ac8cf139139dbe55e00e7cc820459cee/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/ac8cf139139dbe55e00e7cc820459cee/status.svg)](https://joss.theoj.org/papers/ac8cf139139dbe55e00e7cc820459cee)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@linuxscout, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

Make sure you're logged in to your GitHub account
Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @arfon know.

✨ Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest ✨

Review checklist for @linuxscout

✨ Important: Please do not use the Convert to issue functionality when working through this checklist, instead, please open any new issues associated with your review in the software repository associated with the submission. ✨

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@ArneTillmann) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
[x] Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Does the paper have a section titled 'Statement of Need' that clearly states what problems the software is designed to solve and who the target audience is?
[x] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[x] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[x] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

@linuxscout, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

Make sure you're logged in to your GitHub account
Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @arfon know.

✨ Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest ✨

Review checklist for @pps121

✨ Important: Please do not use the Convert to issue functionality when working through this checklist, instead, please open any new issues associated with your review in the software repository associated with the submission. ✨

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@ArneTillmann) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
[x] Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Does the paper have a section titled 'Statement of Need' that clearly states what problems the software is designed to solve and who the target audience is?
[x] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[x] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[x] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

whedon commented 3 years ago

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @linuxscout it looks like you're currently assigned to review this paper :tada:.

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

:star: Important :star:

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf

whedon commented 3 years ago

Wordcount for paper.md is 824

whedon commented 3 years ago

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1162/jmlr.2003.3.4-5.993 is OK
- 10.1080/02664763.2021.1919063 is OK
- 10.1162/089976601750264965 is OK
- 10.3115/v1/W14-3110 is OK
- 10.1162/15324430260185574 is OK
- 10.21105/joss.02507 is OK
- 10.17875/gup2020-1338 is OK
- 10.13140/2.1.2393.1847 is OK
- 10.17875/gup2020-1338 is OK

MISSING DOIs

- 10.5260/cca.199178 may be a valid DOI for title: IEEE Xplore Digital Library

INVALID DOIs

- 10.5555/1953048.2078195 is INVALID

whedon commented 3 years ago

Software report (experimental):

github.com/AlDanial/cloc v 1.88  T=0.08 s (676.8 files/s, 52936.0 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Python                          15            462            658           1030
XML                              6              0              0            213
reStructuredText                17            123             77            211
diff                             5             42             49            141
HTML                             2             15              0            134
TeX                              1             14              0            114
PowerShell                       1             49            245            104
make                             2             24              6             75
DOS Batch                        3             23              2             65
Markdown                         1             14              0             56
Jupyter Notebook                 1              0            430             40
INI                              1              4              3             16
YAML                             2              1              2             16
-------------------------------------------------------------------------------
SUM:                            57            771           1472           2215
-------------------------------------------------------------------------------

Statistical information for the repository '702fdb33409d86c5e788a30d' was
gathered on 2021/09/11.
The following historical commit information, by author, was found:

Author                     Commits    Insertions      Deletions    % of changes
AFThielmann                     13           444            295            7.51
Anton Thielmann                 21          1256            449           17.33
ArneTillmann                   172          4166           2990           72.72
kantg                            3           198             40            2.42
tkneib                           1             1              1            0.02

Below are the number of rows from each author that have survived and are still
intact in the current revision:

Author                     Rows      Stability          Age       % in comments
AFThielmann                  81           18.2          4.4               16.05
Anton Thielmann             823           65.5          2.1                6.20
ArneTillmann               1218           29.2          2.5               17.08
kantg                        28           14.1          0.7                3.57

whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

arfon commented 3 years ago

@linuxscout – This is the review thread for the paper. All of our communications will happen here from now on.

Please read the "Reviewer instructions & questions" in the first comment above.

Both reviewers have checklists at the top of this thread (in that first comment) with the JOSS requirements. As you go over the submission, please check any items that you feel have been satisfied. There are also links to the JOSS reviewer guidelines.

The JOSS review is different from most other journals. Our goal is to work with the authors to help them meet our criteria instead of merely passing judgment on the submission. As such, the reviewers are encouraged to submit issues and pull requests on the software repository. When doing so, please mention https://github.com/openjournals/joss-reviews/issues/3719 so that a link is created to this thread (and I can keep an eye on what is happening). Please also feel free to comment and ask questions on this thread. In my experience, it is better to post comments/questions/suggestions as you come across them instead of waiting until you've reviewed the entire package.

We aim for the review process to be completed within about 4-6 weeks but please make a start well ahead of this as JOSS reviews are by their nature iterative and any early feedback you may be able to provide to the author will be very helpful in meeting this schedule.

ChrisW09 commented 3 years ago

Hi @linuxscout, we are very much looking forward to the review and your comments. thanks.

arfon commented 3 years ago

@whedon add @pps121 as reviewer

whedon commented 3 years ago

OK, @pps121 is now a reviewer

arfon commented 3 years ago

@pps121 - thanks for agreeing to review this submission for us! Please take a look at my instructions above and complete your checklist as you work through your review.

ChrisW09 commented 3 years ago

@arfon fantastic, thank you! @linuxscout @pps121 should you have any early ideas or comments, on how to improve the paper or make things more clear, please just let us know, and we will implement them as soon as we can. thanks a lot.

linuxscout commented 3 years ago

I'am not assigned to this issue, to perform review, please assign me.

ChrisW09 commented 3 years ago

@linuxscout thanks for the update! @arfon could you please help with this question "I'am not assigned to this issue, to perform the review, please assign me." thank you!

arfon commented 3 years ago

I'am not assigned to this issue, to perform review, please assign me.

@linuxscout – did you accept the invitation at https://github.com/openjournals/joss-reviews/invitations ?

arfon commented 3 years ago

@whedon re-invite @linuxscout as reviewer

whedon commented 3 years ago

OK, the reviewer has been re-invited.

@linuxscout please accept the invite by clicking this link: https://github.com/openjournals/joss-reviews/invitations

linuxscout commented 3 years ago

ok, thanks.

linuxscout commented 3 years ago

Hi, I finished the review.

ChrisW09 commented 3 years ago

@linuxscout thanks a lot for being that quick with the review. We really appreciated your comments and the fast review process!

ChrisW09 commented 3 years ago

@pps121 we are already looking forward to your comments. Please let us know if anything is unclear :) thanks!

ChrisW09 commented 3 years ago

Hi, @pps121 how are things going with the review? Please let us know if anything is unclear. thanks!

arfon commented 3 years ago

FYI I just emailed @pps121 to see when they might be able to complete their review by.

ChrisW09 commented 3 years ago

Thank you @arfon! We are looking forward to your feedback @pps121.

arfon commented 3 years ago

I just heard back from @pps121 and they are committed to completing their review soon, but are currently busy with school/university commitments.

ChrisW09 commented 3 years ago

Great, thank you both @arfon and @pps121!

pps121 commented 3 years ago

When I ran the code: from nltk.corpus import reuters

it gave me LookupError as below :

Resource reuters not found. Please use the NLTK Downloader to obtain the resource:

import nltk nltk.download('reuters')

pps121 commented 3 years ago

scraped_documents = audo.get_ieee("https://ieeexplore.ieee.org/search /searchresult.jsp?newsearch=true& queryText=cotton&highlight=true& returnFacets=ALL&returnType=SEARCH& matchPubs=true&rowsPerPage=100& pageNumber=1\", pages=1)

Here , there should not a slash() after pageNumber=1, otherwise it throws syntax error.

pps121 commented 3 years ago

For the above LookUpError Below code gives a fix:

import nltk nltk.download('reuters')

It downloads a reuters zip inside /nltk_data/corpora, then reloads to execute the cell will proceed to next step.

pps121 commented 3 years ago

Before executing the line preprocessed_target = audo.text_cleaning(data=data, column="text")

We need to load wordnet as below otherwise there will be a module not found runtime error

import nltk nltk.download('wordnet')

ChrisW09 commented 3 years ago

Dear @pps121, thank you very much for your helpful and valuable comments. We will fix these issues as quickly as possible and get back to you soon!

pps121 commented 3 years ago

The code level steps should be more elaborated, so that before executing it can give a high level purpose of its use.
In .travis.yml, the yaml expose a severe security hole: the username and password are mentioned as below: username: ArneTillmann password: 2AQeUe5iHHEe0MrEv7

Gitignore should take care of it before merging.

pps121 commented 3 years ago

The naming convention of ipython notebook should be more meaningful rather than example.ipynb

pps121 commented 3 years ago

@ChrisW09 Can you update why number of topics is mentioned as 5? Methods lda_modeling() and lda_visualize_topics() take long time to execute. Can you please share your thoughts.

ChrisW09 commented 3 years ago

@pps121 thank you very much for the comments! We will get back to you as soon as possible.

ArneTillmann commented 3 years ago

Hi @pps121 first of all, thank you for taking your time and reviewing our work. I will reply to all you comments now!

ArneTillmann commented 3 years ago

The naming convention of ipython notebook should be more meaningful rather than example.ipynb

I changed the name to usage_example.ipynb. Is that better, or do you have something else in mind?

ArneTillmann commented 3 years ago

@ChrisW09 Can you update why number of topics is mentioned as 5? Methods lda_modeling() and lda_visualize_topics() take long time to execute. Can you please share your thoughts.

The number of topics in the lda model needs to be fixed before the process. This is users' choice. We chose five here because it yields to results that are easy to interpret. Regarding the slow execution time, I could not reproduce that issue. On my machine it took less than 10 seconds to fit the model and produce the visualization. Can you try again, or try in a different enviroment?

ArneTillmann commented 3 years ago

When I ran the code: from nltk.corpus import reuters

it gave me LookupError as below :

Resource reuters not found. Please use the NLTK Downloader to obtain the resource:

import nltk nltk.download('reuters')

Thank you for this advice. I added those lines to the usage_example.ipynb

ArneTillmann commented 3 years ago

Before executing the line preprocessed_target = audo.text_cleaning(data=data, column="text")

We need to load wordnet as below otherwise there will be a module not found runtime error

import nltk nltk.download('wordnet')

as well as those.

ArneTillmann commented 3 years ago

* The code level steps should be more elaborated, so that before executing it can give a high level purpose of its use.

* In .travis.yml, the yaml expose a severe security hole: the username and password are mentioned as below:
  **username**: ArneTillmann
  **password**: 2AQeUe5iHHEe0MrEv7

Gitignore should take care of it before merging.

Thank you especially for this hint. I didn't notice, but I changed the password and encrypted it now. However, I don't really know which files or lines of code you think should be more elaborated. Can you specify what you would prefer here?

pps121 commented 3 years ago

The naming convention of ipython notebook should be more meaningful rather than example.ipynb

I changed the name to usage_example.ipynb. Is that better, or do you have something else in mind?

It it better now. Thanks.

pps121 commented 3 years ago

The naming convention of ipython notebook should be more meaningful rather than example.ipynb

I changed the name to usage_example.ipynb. Is that better, or do you have something else in mind?

``

* The code level steps should be more elaborated, so that before executing it can give a high level purpose of its use.

* In .travis.yml, the yaml expose a severe security hole: the username and password are mentioned as below:
  **username**: ArneTillmann
  **password**: 2AQeUe5iHHEe0MrEv7
Gitignore should take care of it before merging.
Thank you especially for this hint. I didn't notice, but I changed the password and encrypted it now. However, I don't really know which files or lines of code you think should be more elaborated. Can you specify what you would prefer here?

Thanks for the encryption step. it is okay now about the current state of the code modules.

pps121 commented 3 years ago

Below are my thoughts and suggestions:

Should include more examples of how to use the software thinking from cross domain horizons (to solve by real-world problems)
For different OS variants, can you include steps for both bash and PowerShell so that general users with different OS can find this more useful?
Should clearly mention, how this packages are better than other baselines / commonly-used packages to decide state of the field.

Thank you.

ChrisW09 commented 3 years ago

@whedon generate pdf

whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

ChrisW09 commented 3 years ago

@pps121 thank you again for your comments and advise.

I updated the paper to address your first comment "Should include more examples of how to use the software thinking from cross domain horizons (to solve by real-world problems)" by including the following explanations:

Hence, AuDoLab has a broad range of scientific research or business real world applications. In the following, a few potential use cases will be briefly discussed that should illustrate the broad range of applications in various domains. For example AuDoLab could be used to identify emails with very specific topics such as fraud or money laundering that might have an extremely low prevalence. Similarly, AuDoLab could be used in the medical field to classify medical documents that are concerned with very specific topics such as heart attacks or dental problems. Furthermore, AuDoLab may be used to identify legal documents with very specific topics such as machine learning. Note that, the only limiting factor to the broad range of use cases, is the availability of out-of-domain training data, that can be generated via Web Scraping from IEEEXplore [@IEEE], arxiv or pubmed. Given that a broad range of training documents can be obtained from these websites AuDoLab has a correspondingly broad range of applications.

ChrisW09 commented 3 years ago

@pps121 Regarding your second comment "For different OS variants, can you include steps for both bash and PowerShell so that general users with different OS can find this more useful?".

We tested our package on various OS variants, namely macOS, unix and windows.

We will add explanations for the installation for both bash and PowerShell. This will make the package more useful for the general user as you suggested. Thank you for these suggestions!

ChrisW09 commented 3 years ago

@pps121 Thank you for your comment "Should clearly mention, how this packages are better than other baselines / commonly-used packages to decide state of the field."

In the "Comparison with existing tools" section in our paper we point out that there are no other baselines / commonly-used packages, because AuDoLab is based on the statistical methodology that was recently developed by us in this publication:

Thielmann, A., Weisser, C., Krenz, A., & Säfken, B. (2021). Unsupervised document clas- sification integrating web scraping, one-class SVM and LDA topic modelling. Journal of Applied Statistics, 1–18. https://doi.org/10.1080/02664763.2021.1919063

We now updated the "Comparison with existing tools" to elaborate more on this point:

"At the moment no Python Package with a comparable functionality of AuDoLab is available, since AuDoLab is based on a novel and recently published classification prodcedure [@Thielmann]. Thereby, AuDoLab uses and integrates in particular a combination of Web Scraping, Topic Modelling and One-class Classifcation for which various individual packages are available. Details on the statistical methodology can be found in [@Thielmann]. An application of the methodology on a data set of patent data can found in [@Thielmann2021]. For Topic Modelling available packages are the LDA algorithm as implemented in the package Gensim [@rehurek_lrec] or the package TTLocVis [@Kant2020] for short and sparse text. Visual representations of the topics can be implemented with LDAvis [@ldavis]. The One-class SVM classification package is availabe in Scikit-learn [@scikit-learn]. Alternative Further research could explore Deep Learning Algorithms as well [@Saefken2020; @Saefken2021]."

Please let us know, if this fits with your expectations or what else we should change. Thank you very much for your advise and help!

ChrisW09 commented 3 years ago

@whedon generate pdf

whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

openjournals / joss-reviews