openjournals / joss-reviews

Reviews for the Journal of Open Source Software
Creative Commons Zero v1.0 Universal
708 stars 37 forks source link

[REVIEW]: seesus: a social, environmental, and economic sustainability classifier for Python #6244

Closed editorialbot closed 5 months ago

editorialbot commented 8 months ago

Submitting author: !--author-handle-->@caimeng2<!--end-author-handle-- (Meng Cai) Repository: https://github.com/caimeng2/seesus Branch with paper.md (empty if default branch): Version: v1.2.1 Editor: !--editor-->@oliviaguest<!--end-editor-- Reviewers: @varsha2509, @luyuhao0326 Archive: 10.5281/zenodo.10854083

Status

status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/6bfbe71ac4a3f4799c6cbbfb15a07ff6"><img src="https://joss.theoj.org/papers/6bfbe71ac4a3f4799c6cbbfb15a07ff6/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/6bfbe71ac4a3f4799c6cbbfb15a07ff6/status.svg)](https://joss.theoj.org/papers/6bfbe71ac4a3f4799c6cbbfb15a07ff6)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@varsha2509, your review will be checklist based. Each of you will have a separate checklist that you should update when carrying out your review. First of all you need to run this command in a separate comment to create the checklist:

@editorialbot generate my checklist

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @oliviaguest know.

Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest

Checklists

📝 Checklist for @luyuhao0326

📝 Checklist for @varsha2509

editorialbot commented 8 months ago

Hello humans, I'm @editorialbot, a robot that can help you with some common editorial tasks.

For a list of things I can do to help you, just type:

@editorialbot commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@editorialbot generate pdf
editorialbot commented 8 months ago
Software report:

github.com/AlDanial/cloc v 1.88  T=0.07 s (291.7 files/s, 69714.4 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
HTML                             7             95              0           2244
Python                           6             60            153           1056
TeX                              1             12              0            143
Markdown                         2             33              0             97
Jupyter Notebook                 1              0            563             29
TOML                             1              2              0             26
YAML                             1              1              9             18
-------------------------------------------------------------------------------
SUM:                            19            203            725           3613
-------------------------------------------------------------------------------

gitinspector failed to run statistical information for the repository
editorialbot commented 8 months ago

Wordcount for paper.md is 938

editorialbot commented 8 months ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1002/bse.2195 is OK
- 10.21105/joss.05124 is OK
- 10.1016/j.enpol.2008.02.039 is OK
- 10.1007/s10668-016-9801-z is OK
- 10.3390/ECP2023-14728 is OK
- 10.5040/9781509934058.0025 is OK
- 10.1007/978-981-10-3521-0_31 is OK
- 10.3390/su14053095 is OK

MISSING DOIs

- None

INVALID DOIs

- None
editorialbot commented 8 months ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

oliviaguest commented 8 months ago

@editorialbot add @luyuhao0326 to reviewers

editorialbot commented 8 months ago

@luyuhao0326 added to the reviewers list!

oliviaguest commented 8 months ago

:wave: Hi @varsha2509, @luyuhao0326, thank you so much for helping out at JOSS. If you need any pointers, please feel free to look at previous reviews (which can be found by looking at published papers) and the documentation. If you need to comment on the code itself, opening an issue at the repo and then linking to it from here (to help me/others keep track) is the way to go. For comments on the paper, you can also open issues or PRs (say for typos), but those can be directly posted as replies in this issue. Thanks, and feel free to reach out if you need me. :relaxed:

luyuhao0326 commented 8 months ago

Review checklist for @luyuhao0326

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

luyuhao0326 commented 8 months ago

Thanks for the invitation and below are my reviews on installation, and software paper (most of my comments will be related to the paper since I am not a proficient Python user).

Installation

Software paper

caimeng2 commented 8 months ago

Hi @luyuhao0326,

Thank you very much for taking the time to review seesus. We appreciate your helpful feedback. Please find our point-by-point responses below.

Installation

It would be great to include more detailed installation instructions for users who are not so familiar with Python (e.g., me and many others who will be potentially benefiting from this work) and/or GitHub.

Thank you for your suggestion. seesus is indeed a Python-based software that requires basic knowledge of Python programming. To simplify the installation process, we chose to publish seesus to PyPI and use pip install. In this way, users can easily install the package with one line of command, without the need to manually manage dependencies and configure the package. We have made the installation instructions clearer as suggested (6f02afd).

That being said, I am unable to install and run this package on my machine. I will be happy to do so if one of the authors can help me install the package.

I am more than happy to help. Do you already have Python, pip, and Jupyter (for running the example.ipynb) installed? If yes, typing pip install seesus in your terminal should do the job. If not, I would recommend installing Anaconda first. Please go to Anaconda's website and install it for your specific operating system (instructions). Then you should be able to install seesus by inputting pip install seesus in your terminal. Please let me know if you encounter any problems.

Software paper

The idea is novel and I can see this work being useful in many domains. I have one question regarding the example use cases listed in the paper: the paper claims that seesus can be used in "label academic publications" and "large-scale scans of planning documents". However, the example in README.md only shows how seesus evaluates individual sentences which can cause potential misinterpretation and biased results as the context of "academic publications" and "planning documents" will likely be missing when being evaluated sentence by sentence. Example 3 provided here, for example is not really a paragraph.

Glad to hear that you find our package novel and potentially useful in many domains. To achieve the best results, we recommend splitting a paragraph or a whole document into individual sentences (i.e., using individual sentences as the basic unit for seesus to analyze). This was the reason why we only showed how seesus evaluates individual sentences in README.md at the beginning. Thank you for pointing out that this might cause misinterpretation. To address this concern, first, we have copied the paragraph example (example 3) in example.ipynb to README.md (66862f8). Here you can tell this example is a paragraph (i.e., a chunk of text with several sentences) by scrolling to the right. The display of a Jupyter Notebook in GitHub is a bit confusing because the text is truncated. We’ve added a print statement to prevent this confusion (c77e33e). Second, we have added another example in example.ipynb to demonstrate the package’s usage in the context of academic publications (c77e33e). For both the examples of an academic publication and a planning document, we split the paragraphs into sentences and printed out the results for each sentence. Users can organize the results according to their needs.

The statement of need is clear but a bit thin. Although I can appreciate JOSS is a more software-focused journal, it would still be great to provide some context on the current status of text mining/classification on UN-SDG and why it is important to for example, "quantify which dimension of sustainability receives the most attention"

Thank you for your suggestion. We have incorporated additional context of text mining on SDGs in our paper as suggested (c8162fe). Given that JOSS requires papers to be between 250-1000 words (source), we hope the edits are sufficient to provide the necessary improvement to our statement of need.

Accuracy of 75.5% is decent but is not particularly high. Considering the evaluation method is from a different package, it would be great if the authors can provide a statement (or even better specific development plans) on how to improve the accuracy and/or usabililty of future text-mining on SDG.

This is a great idea. We’ve added a statement on maintenance in README.md to address this (1eacaaf). Following the best practices of open-source software, we welcome and encourage users to report issues if they find that a matching syntax is not accurate or can be improved.

Thanks again for your time and suggestions!

caimeng2 commented 8 months ago

@editorialbot generate pdf

editorialbot commented 8 months ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

luyuhao0326 commented 8 months ago

@caimeng2 Thanks for your response. I will get back to you within a week or so. I will also try to install the package and give feedback if there is any.

Yuhao

luyuhao0326 commented 8 months ago

I am more than happy to help. Do you already have Python, pip, and Jupyter (for running the example.ipynb) installed? If yes, typing pip install seesus in your terminal should do the job. If not, I would recommend installing Anaconda first. Please go to Anaconda's website and install it for your specific operating system (instructions). Then you should be able to install seesus by inputting pip install seesus in your terminal. Please let me know if you encounter any problems.

Hello, I managed to install the package and while testing one of the provided examples, I encountered a LookupError. Please see the code here.

caimeng2 commented 7 months ago

Hello, I managed to install the package and while testing one of the provided examples, I encountered a LookupError. Please see the code here.

Hi @luyuhao0326, I'm so glad that you got the installation working :tada: The link to the LookupError is pointing to your localhost so I can't see the traceback. But I suspect it's a bug with nltk (see this). Feel free to try some of the solutions proposed there. An alternative is to use re instead of nltk. Please see if the following code works.

from seesus import SeeSus
import re

text2 = "By working with communities in the floodplain and facilitating flood-resistant building design, DCP is reducing the city’s risks to sea level rise and coastal flooding. Hurricane Sandy was a stark reminder of these risks. The City, led by the Mayor’s Office of Recovery and Resiliency (ORR), has developed a multifaceted plan for recovering from Sandy and improving the city’s resiliency–the ability of its neighborhoods, buildings and infrastructure to withstand and recover quickly from flooding and climate events. As part of this effort, DCP has initiated a series of projects to identify and implement land use and zoning changes as well as other actions needed to support the short-term recovery and long-term vitality of communities affected by Hurricane Sandy and other areas at risk of coastal flooding."

for sent in re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text2):
    result = SeeSus(sent)
    print('"', sent, '"', sep = "")
    print("Is the sentence related to achieving sustainability?", result.sus)
    print("Which SDGs?", result.sdg)
    print("Which SDG targets specifically?", result.target)
    print("which dimensions of sustainability?", result.see)
    print("----------------")

Thank you for letting me know about this issue. I'll update the examples.

luyuhao0326 commented 7 months ago

Hello, I managed to install the package and while testing one of the provided examples, I encountered a LookupError. Please see the code here.

Hi @luyuhao0326, I'm so glad that you got the installation working 🎉 The link to the LookupError is pointing to your localhost so I can't see the traceback. But I suspect it's a bug with nltk (see this). Feel free to try some of the solutions proposed there. An alternative is to use re instead of nltk. Please see if the following code works.

from seesus import SeeSus
import re

text2 = "By working with communities in the floodplain and facilitating flood-resistant building design, DCP is reducing the city’s risks to sea level rise and coastal flooding. Hurricane Sandy was a stark reminder of these risks. The City, led by the Mayor’s Office of Recovery and Resiliency (ORR), has developed a multifaceted plan for recovering from Sandy and improving the city’s resiliency–the ability of its neighborhoods, buildings and infrastructure to withstand and recover quickly from flooding and climate events. As part of this effort, DCP has initiated a series of projects to identify and implement land use and zoning changes as well as other actions needed to support the short-term recovery and long-term vitality of communities affected by Hurricane Sandy and other areas at risk of coastal flooding."

for sent in re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text2):
    result = SeeSus(sent)
    print('"', sent, '"', sep = "")
    print("Is the sentence related to achieving sustainability?", result.sus)
    print("Which SDGs?", result.sdg)
    print("Which SDG targets specifically?", result.target)
    print("which dimensions of sustainability?", result.see)
    print("----------------")

Thank you for letting me know about this issue. I'll update the examples.

Indeed this is the bug. It is now working with re

luyuhao0326 commented 7 months ago

@luyuhao0326 added to the reviewers list!

The authors have addressed all my comments and made appropriate revisions. I recommend this submission to be accepted by JOSS.

caimeng2 commented 7 months ago

@luyuhao0326 added to the reviewers list!

The authors have addressed all my comments and made appropriate revisions. I recommend this submission to be accepted by JOSS.

Thanks again for your suggestions, which helped to make our paper and software better!

oliviaguest commented 7 months ago

@varsha2509 is everything going OK with your review? 😊

varsha2509 commented 7 months ago

Review checklist for @varsha2509

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

varsha2509 commented 7 months ago

Hello. Thank you for giving me the opportunity to review this work. This authors have done a great job documenting the software, installation instructions and the functionality. Below are my comments based on the review checklist as well as a additional notes to help improve readability and adoption of this work:

  1. Functionality - while the functional claims of the software have been verified, could the authors provide more details on how the indirect search keys in SDG_keys.py specifically from this line onwards were determined? Can the authors confirm that all targets were included in the keywords?
  2. Automated tests - while the existing tests cover the explained functionality, the authors should consider including more examples in tests, especially relevant sentences that have a negative connotation to clarify the performance of this tool.
  3. Statement of need -
    • The existing statement of need isn't particularly strong. It's not very clear to me what the benefits of seesus are over existing tools, other than the functionality which classifies the expression as environmental, social or economic sustainability. Making the statement of need stronger will help improve adoption of this tool.
    • Could the authors provide an example of what they mean by "also the attainment of SDGs" as specified in the statement of need?
  4. State of the field -
    • OSDG (https://arxiv.org/abs/2211.11252, https://github.com/osdg-ai/osdg-tool) is another open source tool for text based classification of SDG goals and these use NLP/ML based methods. This may be worth highlighting as one of the other classifiers in the statement of need. Along with this, could the authors also include why users would consider seesus over existing open source tools?

Other notes:

caimeng2 commented 7 months ago

Hi @varsha2509,

Thank you for taking the time to review our software and for your valuable feedback. Please find our point-by-point responses below.

Functionality - while the functional claims of the software have been verified, could the authors provide more details on how the indirect search keys in SDG_keys.py specifically from this line onwards were determined? Can the authors confirm that all targets were included in the keywords?

Yes, we can confirm that all targets are included in the keywords. We created the search keys at the levels of both the 17 SDGs and the 169 SDG targets. The indirect keys were first based on Thesaurus, and we (four researchers specialized in SDGs) manually assessed and improved the accuracy of the matching syntax by using thousands of randomly-selected statements from corporate reports. We conducted three rounds of fine-tuning and finalized these keys.

Automated tests - while the existing tests cover the explained functionality, the authors should consider including more examples in tests, especially relevant sentences that have a negative connotation to clarify the performance of this tool.

Thank you for pointing this out. Indeed, matching with negative connotation is seesus’s limitation. seesus can identify the terms related to SDGs but cannot distinguish between achieving SDGs and failing to do so. This limitation lies in regular expression’s limited logic capability and lack of context awareness. We have added another test of direct matching (4968c35) and edited the paper, deleting expressions regarding “attainment of SDGs,” to make it clear that seesus is designed to classify based on relevance.

Statement of need - The existing statement of need isn't particularly strong. It's not very clear to me what the benefits of seesus are over existing tools, other than the functionality which classifies the expression as environmental, social or economic sustainability. Making the statement of need stronger will help improve adoption of this tool.

The biggest benefit of seesus is its finer scale: it captures not only the SDGs but also the 169 SDG targets. To the best of our knowledge, no other Python tool does this. In addition, compared to tools based on machine learning, seesus allows users to examine and modify the matching syntax, so users can always understand and have control over the results. We’ve edited the statement of need to make it stronger as suggested (3f35864).

Could the authors provide an example of what they mean by "also the attainment of SDGs" as specified in the statement of need?

What we meant was seesus specifically looks for terms that are related to achieving the SDGs, and not just SDG-related topics themselves. For example, it is not to find words solely related to emissions (e.g., "emissions", "carbon"), but it looks for terms such as "lowering emissions" and "reducing carbon." However, we realized that this sentence is rather confusing as seesus cannot identify negative expressions, so we have deleted it to avoid further confusion.

State of the field - OSDG (https://arxiv.org/abs/2211.11252, https://github.com/osdg-ai/osdg-tool) is another open source tool for text based classification of SDG goals and these use NLP/ML based methods. This may be worth highlighting as one of the other classifiers in the statement of need. Along with this, could the authors also include why users would consider seesus over existing open source tools?

Thank you for this reference. We have added it to the existing classifiers. We tested OSDG and noticed that it is not able to capture negative expressions either, and the results are only the 17 SDGs, not the targets. We have revised our paper to highlight seesus’s benefit.

Other notes: Running through the code and script as examples, the current tool is not able to capture negative expressions, as Regex lacks semantic understanding of text. For instance using this sentence as an input "One should not resolve climate change for environmental sustainability." this is being classified as relating to achieving SDG13 and SDG15 but the output should be 'None' or 'Does not match SDG goals'. Screenshot 2024-02-19 at 3 23 50 PM This seems to be a limitation of this tool and it would be worth highlighting this in a separate section and including some ideas on how the authors plan to address these limitations in future releases of this tool. This will help users be fully aware of the benefits and limitations of this software. Related to above, could the authors talk briefly about limitations of regex for pattern matching over existing semantic text search/language models?

Thank you for your suggestion. This is a very good point. Compared to language models, regex lacks the ability to understand the semantic meaning or context of text, as it operates based on character patterns. As suggested, we have added a paragraph at the end of the paper to make the limitation of seesus more clear.

The authors mention that seesus achieves an accuracy rate of 75.5%, as determined by alignment with manual coding. Can the authors comment on how they plan to improve the performance of this tool in future releases as 75% accuracy currently seems low for usability.

Yes, we have included it in the revision of the paper. We devoted hundreds of hours to fine-tune the matching syntax. 75.5% seems low but it is quite reasonable for traditional qualitative analysis. The human intercoder agreement on the same text was only at 83%.

Thanks again for all your comments and suggestions! We feel that our paper is much clearer and stronger than the previous version. Thank you!!! Please let us know if there's anything else.

caimeng2 commented 7 months ago

@editorialbot generate pdf

editorialbot commented 7 months ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

varsha2509 commented 6 months ago

Hello @caimeng2 - thank you for addressing my comments. Please see my responses below:

Yes, we can confirm that all targets are included in the keywords. We created the search keys at the levels of both the 17 SDGs and the 169 SDG targets. The indirect keys were first based on Thesaurus, and we (four researchers specialized in SDGs) manually assessed and improved the accuracy of the matching syntax by using thousands of randomly-selected statements from corporate reports. We conducted three rounds of fine-tuning and finalized these keys.

Thanks for confirming this. Depending on word limit, I'd recommend including a line or two about this in the paper, or in the Github Readme, under a methodology section.

Besides the suggestion above, the authors have fully addressed all of my comments and made revisions where necessary. I recommend this submission to be accepted by JOSS.

caimeng2 commented 6 months ago

Hi @varsha2509,

Thanks for confirming this. Depending on word limit, I'd recommend including a line or two about this in the paper, or in the Github Readme, under a methodology section.

This is a great suggestion! We revised the paper accordingly (018e7e1) and added a methodology section to README.

We are glad to hear that you found our revisions satisfactory, and appreciate your recommendation for acceptance. Thank you again for your thorough review of our submission!

oliviaguest commented 6 months ago

Post-Review Checklist for Editor and Authors

Additional Author Tasks After Review is Complete

Editor Tasks Prior to Acceptance

oliviaguest commented 6 months ago

@caimeng2 what is left to do (other than the above)? ☺️

caimeng2 commented 6 months ago

@caimeng2 what is left to do (other than the above)? ☺️

I believe only the above. I will have the author tasks done by the end of this week.

caimeng2 commented 6 months ago

Hi @oliviaguest, I finished the author tasks listed above. Please let me know if there's anything else :nerd_face:

  • Double check authors and affiliations (including ORCIDs)

Checked

  • Make a release of the software with the latest changes from the review and post the version number here. This is the version that will be used in the JOSS paper.

v1.0

  • Archive the release on Zenodo/figshare/etc and post the DOI here.

DOI

  • Make sure that the title and author list (including ORCIDs) in the archive match those in the JOSS paper.

Checked

  • Make sure that the license listed for the archive is the same as the software license.

Checked

caimeng2 commented 6 months ago

@editorialbot generate pdf

caimeng2 commented 6 months ago

@editorialbot check references

editorialbot commented 6 months ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

editorialbot commented 6 months ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1002/bse.2195 is OK
- 10.21105/joss.05124 is OK
- 10.1016/j.enpol.2008.02.039 is OK
- 10.48550/arXiv.2211.11252 is OK
- 10.1007/s10668-016-9801-z is OK
- 10.3390/ECP2023-14728 is OK
- 10.5040/9781509934058.0025 is OK
- 10.1007/978-981-10-3521-0_31 is OK
- 10.3390/su14053095 is OK

MISSING DOIs

- No DOI given, and none found for title: SDG Auto Labeller
- No DOI given, and none found for title: EUR-SDG-Mapper
- No DOI given, and none found for title: UN-SDG-Classifier
- No DOI given, and none found for title: SDG-Classifier

INVALID DOIs

- None
oliviaguest commented 6 months ago

@editorialbot set 10.5281/zenodo.10854083 as archive

editorialbot commented 6 months ago

Done! archive is now 10.5281/zenodo.10854083

oliviaguest commented 6 months ago

@caimeng2 why is it Version: v1.2.0 above?

oliviaguest commented 6 months ago

@caimeng2 see: https://github.com/caimeng2/seesus/pull/3 ☺️

caimeng2 commented 5 months ago

@caimeng2 why is it Version: v1.2.0 above?

Ah my bad. That was the version number for PyPI, which I totally forgot. Should have made them consistent

caimeng2 commented 5 months ago

Hi @oliviaguest, I made a new release and redid the tasks above. Sorry about the inconvenience.

Double check authors and affiliations (including ORCIDs)

Checked

Make a release of the software with the latest changes from the review and post the version number here. This is the version that will be used in the JOSS paper.

v1.2.1

Archive the release on Zenodo/figshare/etc and post the DOI here.

DOI

Make sure that the title and author list (including ORCIDs) in the archive match those in the JOSS paper.

Checked

Make sure that the license listed for the archive is the same as the software license.

Checked

oliviaguest commented 5 months ago

@caimeng2 thank you!

oliviaguest commented 5 months ago

@editorialbot generate pdf

editorialbot commented 5 months ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

oliviaguest commented 5 months ago

@editorialbot set v1.2.1 as version

editorialbot commented 5 months ago

Done! version is now v1.2.1

oliviaguest commented 5 months ago

@caimeng2 is that the right version?

caimeng2 commented 5 months ago

@caimeng2 is that the right version?

Yes!

oliviaguest commented 5 months ago

@editorialbot recommend-accept

editorialbot commented 5 months ago
Attempting dry run of processing paper acceptance...