Paper: NLP-in-the-real-world: Tool for building NLP solutions

jsingh811 commented 4 months ago

If you are creating this PR in order to submit a draft of your paper, please name your PR with Paper: <title>. An editor will then add a paper label and GitHub Actions will be run to check and build your paper.

See the project readme for more information.

Editor: Hongsup Shin @hongsupshin

Reviewers:

Abhay Dutt Paroha @aparoha
Jane Adams @janeadams

github-actions[bot] commented 4 months ago

Curvenote Preview

Directory	Preview	Checks	Updated (UTC)
papers/jyotika_singh	🔍 Inspect	✅ 45 checks passed (10 optional)	Jun 30, 2024, 7:58 AM

janeadams commented 3 months ago

Hi! I'm excited to be reviewing in this new format, but I admit that I am a little lost... Where can I find the code that accompanies the paper? When I click the link on this page to open Jupyter notebooks I get the error "error - Error: repo is required for github provider - Failed to connect to binderhub at https://xhrtcvh6l53u.curvenote.dev/services/binder/".

Due to the reliance of the paper on the content of the notebooks, it is important that I be able to read them together to effectively assess the contribution. Thanks!

Jane

jsingh811 commented 3 months ago

Hi @janeadams It is very nice to meet you. Thank you very much for agreeing to review this paper! I look forward to your review. I clicked on the URL you shared yesterday from my phone, and I could see there was a Jupyter notebooks link at the top showing me the same error. I was a bit confused as I hadn't seen that link in my local environment, and I didn't remember setting it up either. As I got ready to dig into this today, I noticed I no longer see the Jupyter notebooks link on that page from my laptop or my phone. Could you kindly confirm this is the case at your end as well? The link to any code this paper references is linked in the contents of the paper itself (https://github.com/jsingh811/NLP-in-the-real-world.) Thanks!

rowanc1 commented 3 months ago

Hi @janeadams and @jsingh811 - just a quick comment on the technical side for the "open Jupyter notebooks" link. This was briefly enabled last week, and we disabled it; this is something we are excited to pilot, hopefully it will be available for 2025 proceedings more generally.

janeadams commented 3 months ago

After reviewing the notebooks, I have some concerns about this submission. My primary concern is that the submission here is only a small supplement to the primary contribution, which is the code, which is itself a supplement to an existing published work. I am not sure that this submission meets the basic criteria for SciPy proceedings, as it is not a novel contribution. Evaluating only the paper which has been submitted here:

There is no explanation of how / why certain libraries were chosen over others. Why use word clouds for visualization when the visualization research community has roundly recommended against word clouds for visual analytics and far more robust tools exist, especially in the context of NLP?
The paper appears heavily reliant on the book; this isn't in the spirit of SciPy which aims to make proceedings accessible in full
Given the rapid rate of change in this field, "several functionality and software updates will aid in keeping pace with the rapid advancements in the field" feels like an insufficient way to address whether/when code will become stale. See my comments below on the code -- I think that a wiki referencing documentation would be at far lower risk of this problem.

Additionally, I have several concerns about the code itself. While the code is not included in this submission, and therefore I can't comment on it directly in this repo, I noticed several major problems:

The README does not follow standard README formats (e.g. "Installation", "Getting Started", etc.)
There are no documents that I would expect for code of this nature, such as a requirements.txt
The notebooks contain numerous in-line pip installs. These should all be in a requirements.txt instead, and cell outputs shouldn't be included in the notebooks. In general, it is not standard practice to encourage users to execute shell prompts from within Jupyter Notebooks
Additionally, cell outputs with information about the contributors' computer should be cleared, e.g. "Downloading package wordnet to /Users/jsingh/nltk_data..."
Some in-line code comments encourage non-universal or potentially dangerous practices for new users, e.g. "brew install [x]"
Many of the examples included in the notebooks appear to be taken almost identically from package documentation. Why is it necessary to copy this information over? Why not just build a wiki with links to documentation, so that things can stay up to date?

I am not sure that these changes are within the scope of the review cycle, and would recommend foregoing inclusion of this submission in this year's SciPy proceedings. I think the notebooks might make a helpful foundation for a workshop (which is far more ephemeral and therefore not beholden to longevity concerns the way archival proceedings are) and would encourage the authors to consider building a wiki to reference package documentation, so that the maintainers of the packages referenced are always the go-to source for the latest implementation. I would have liked to see more content about why a user should choose one method over another, as I believe this guidance would have been a more valuable and novel contribution.

jsingh811 commented 3 months ago

After reviewing the notebooks, I have some concerns about this submission. My primary concern is that the submission here is only a small supplement to the primary contribution, which is the code, which is itself a supplement to an existing published work. I am not sure that this submission meets the basic criteria for SciPy proceedings, as it is not a novel contribution. Evaluating only the paper which has been submitted here:

There is no explanation of how / why certain libraries were chosen over others. Why use word clouds for visualization when the visualization research community has roundly recommended against word clouds for visual analytics and far more robust tools exist, especially in the context of NLP?

The paper appears heavily reliant on the book; this isn't in the spirit of SciPy which aims to make proceedings accessible in full

Given the rapid rate of change in this field, "several functionality and software updates will aid in keeping pace with the rapid advancements in the field" feels like an insufficient way to address whether/when code will become stale. See my comments below on the code -- I think that a wiki referencing documentation would be at far lower risk of this problem.

Additionally, I have several concerns about the code itself. While the code is not included in this submission, and therefore I can't comment on it directly in this repo, I noticed several major problems:

The README does not follow standard README formats (e.g. "Installation", "Getting Started", etc.)

There are no documents that I would expect for code of this nature, such as a requirements.txt

The notebooks contain numerous in-line pip installs. These should all be in a requirements.txt instead, and cell outputs shouldn't be included in the notebooks. In general, it is not standard practice to encourage users to execute shell prompts from within Jupyter Notebooks

Additionally, cell outputs with information about the contributors' computer should be cleared, e.g. "Downloading package wordnet to /Users/jsingh/nltk_data..."

Some in-line code comments encourage non-universal or potentially dangerous practices for new users, e.g. "brew install [x]"

Many of the examples included in the notebooks appear to be taken almost identically from package documentation. Why is it necessary to copy this information over? Why not just build a wiki with links to documentation, so that things can stay up to date?

I am not sure that these changes are within the scope of the review cycle, and would recommend foregoing inclusion of this submission in this year's SciPy proceedings. I think the notebooks might make a helpful foundation for a workshop (which is far more ephemeral and therefore not beholden to longevity concerns the way archival proceedings are) and would encourage the authors to consider building a wiki to reference package documentation, so that the maintainers of the packages referenced are always the go-to source for the latest implementation. I would have liked to see more content about why a user should choose one method over another, as I believe this guidance would have been a more valuable and novel contribution.

Hi

Thank you for your comments.

I observe that most of the feedback is heavily centered around the notebooks and want to highlight that the notebooks are not the main contribution of this paper. I feel the tool selection (toolkit) part of the submission might have been missed.

The repo contains 1) notebooks and 2) a toolkit component. The notebooks are an augmentation to a book. The toolkit is not. But the toolkit is certainly inspired by those previous contributions. The toolkit also contains a readme with instructions and requirements.txt.

The notebooks are not meant to be the star of the submission, but I'm talking about the repo in this submission, and thus I have only mentioned the notebooks. The notebooks are designed to serve as standalone guides for users, independent of other files, which is typically how people start coding in this field and use them for experimentation.

Please find my responses to some of your other comments below.

There is no explanation of how / why certain libraries were chosen over others.

This information is shared in the paper. Kindly see the Tool Selection section. Example: There are many open-source pre-trained models that can be used for a majority of data types for sentiment analysis. The different models are trained using various sources of data. Choosing the model that is likely to be better on your type of data has the best chances to give you desirable results. For instance, VADER may be preferable if your data contains informal language, whereas TextBlob may be better if the language in your text is more formal. Many factors play into this choice, including the following.....

If the text for which you want to find similar pieces in the corpus does not have representation in the corpus, then you need to opt for models with a more general understanding, thus pre-trained models would be useful.

If you recommend adding more such details, I can do that.

The paper appears heavily reliant on the book; this isn't in the spirit of SciPy which aims to make proceedings accessible in full

The toolkit component is not reliant on the book. Any reader without access to any other materials will be able to follow. The notebooks go hand-in-hand with the book, and for that reason I have described what they contain in the paper to provide context, so a user can refer to it without access to any other materials.

word clouds

There is a mix of opinions on word clouds. Word clouds can be engaging and easy to digest. However, it is good to not expect in-depth analysis or comparisons from them. Wordclouds are still quite heavily used and in some cases even preferred. That being said, nothing is a general solution and the user has to use it with caution and understanding. There are plans to add more visualization tools in the future version of the toolkit.

I've not addressed some of the other comments since they are focussed on the notebooks rather than the tool selection toolkit.

Here is what I plan to do to address your comments:

adding the context with perhaps a diagram that makes the main contribution of this submission more clear in the paper.
I also notice that I can add better references in the paper, more directly to the toolkit.

My intent is to showcase the tool, with focus on the toolkit. I gave the Tool Selection section the most content in the paper as well, in the attempt to make it the bulk of the submission.

The toolkit, in my opinion, is the start of a very useful decision-making assistant for developers in this space, and represents novelty. When getting started on a new NLP task, Data scientists need to try various tools, experiment with solutions, learn from manual data observations, all to gauge the suitable approach for solving the problem. The toolkit brings together experience and knowledge about underlying data in different models to aid in this process. The alternative would be to spend time to experiment and read lots of material to short-list tools depending on the data and problem. According to this report(https://businessoverbroadway.com/2019/02/19/how-do-data-professionals-spend-their-time-on-data-science-projects), Data Scientists spend 20% of their time building and selecting models, which has a lot of opportunity to reduce and be optimized. This time is even larger for individuals working on a new problem that they haven't worked on before.

I hope the above clarification helps address your comments. Any which way, I appreciate your review and thanks for taking the time.

janeadams commented 3 months ago

My apologies for not fully understanding what is and isn't part of the submission... This method of reviewing is new to me.

I think a diagram would be a great addition to the paper. I see that the contributions are buried in the toolkit, and for the paper to be a contribution on its own, I think the best approach would be to pull out this information into text and tables or figures.

For example, these two excerpts from the toolkit give examples of conditions under which one model would be better suited than another:

if data["corpus_domain_specific"] and data["sample_likely_represented_in_corpus"]: model = "tfidf" elif ( data["corpus_domain_specific"] is False or data["sample_likely_represented_in_corpus"] is False ): model = "spacy" else: model = "spacy"

and

STYLE = { "formal": "textblob", "informal": "vader", "mixed": "textblob", } TYPOS = { "many_nondict_terms": "vader", "some_nondict_terms": "textblob", "mostly_clean": "textblob", } SOURCE = { "social_media": "vader", "review_comments": "textblob", "articles": "textblob", }

I see that these choices are briefly alluded to in the paper, but I think for the paper to stand on its own, those choices should be pulled out and explained. Why is VADER better for social media sources than review comments? Why is spacy better in cases where the corpus is not domain specific? As a reader, I would like to be able to come with my own data set and think about these criteria in my use case context. Importantly, a reader should come away feeling more confident that there is a statistical reason for choosing one model or method over another. This way, readers can apply their own reasoning too -- perhaps they are more concerned about false positives than false negatives, or care less about style and more about typos... By allowing a user to read the affordances and trade-offs of various models side-by-side in a table or even just by reading through the paper, they can make informed decisions without having to use function calls like a form.

In its current state, the paper relies too heavily on code outside of the submission. Rather than referencing that code (for which I still have concerns about that I think are out of scope here), the paper would be the most beneficial as an archival submission if the focus was more on the heuristics / reasoning about which criteria make a model or method more or less suited to a task, rather than as it reads currently as supplemental to code outside this repo.

jsingh811 commented 3 months ago

Thank you for your revised comments @janeadams . I have now made revisions to address your comments. I have cleaned up the paper quite a bit, and removed content pointing to any portions of the tool other than the toolkit to avoid confusion and make it content more pointed and useful standalone. Additionally, I have added a lot more details on tool selection portion itself based on your comment. Thanks.

@aparoha thanks for your comment. I am not certain at this point whether there are 2 reviewers on this. I also received your comment while I was in progress of making updates based on the previous review. Your comment overlaps with @janeadams early comments that I have now addressed. Thanks.

hongsupshin commented 3 months ago

Hi @jsingh811 , my name is Hongsup Shin, and I am a Proceedings co-chair and your paper's editor. Thank you for submitting the manuscript and making revisions based on the reviewers' comments.

First of all, to be clear, we assign two reviewers per SciPy proceedings PR. And your reviewers are @janeadams and @aparoha as noted in the first comment of the PR.

While I appreciate your revisions, I am afraid that the main criticism from both reviewers are still not sufficiently met. I very much agree with the comments from both reviewers @janeadams and @aparoha. The repo is still supporting material of the book authored by you, and it lacks elements to be considered as a decision making tool.

I have raised my and reviewers' concerns to other chairs in Proceedings and Program committees, and members of both committees already favor rejection. At this point, we think the paper requires rewriting rather than mere revision in order to proceed. You have until Aug 7 to address the reviews.

Sincerely, Hongsup Shin

jsingh811 commented 3 months ago

Hi @hongsupshin Thanks for sharing the details. I appreciate the feedback.

Looks like the major point of contention is toolkit repo’s association with the book and associated software. To address it, what I can do is make the toolkit as its own repo and create a python library installable via pip thereby isolating it completely from the book. & I can ensure the paper reflects this the same way. However, it will be very helpful if I can get confirmation on whether this is sufficient to address the mentioned concerns, thereby clearing the path for the submission, and I would be more than happy to get it done within the stated timeline. If not, then I believe I don’t have any other options at my disposal at this time to address the concerns and thus would have to forego the submission.

I would like to thank you and the reviewers either ways for taking time to provide feedback.

hongsupshin commented 3 months ago

@jsingh811 Hi, unfortunately I think you are still missing the main criticism. For instance, some of the major comments from @janeadams (see below) and I don't think you've addressed these yet.

After reviewing the notebooks, I have some concerns about this submission. My primary concern is that the submission here is only a small supplement to the primary contribution, which is the code, which is itself a supplement to an existing published work. I am not sure that this submission meets the basic criteria for SciPy proceedings, as it is not a novel contribution. Evaluating only the paper which has been submitted here:

There is no explanation of how / why certain libraries were chosen over others. Why use word clouds for visualization when the visualization research community has roundly recommended against word clouds for visual analytics and far more robust tools exist, especially in the context of NLP?

The paper appears heavily reliant on the book; this isn't in the spirit of SciPy which aims to make proceedings accessible in full

Given the rapid rate of change in this field, "several functionality and software updates will aid in keeping pace with the rapid advancements in the field" feels like an insufficient way to address whether/when code will become stale. See my comments below on the code -- I think that a wiki referencing documentation would be at far lower risk of this problem.

hongsupshin commented 2 months ago

@jsingh811 Hi, we noticed that your last commit message said "remove paper" and we just wanted to confirm whether this was your final decision. Would you be kind to verify this?

jsingh811 commented 1 month ago

@jsingh811 Hi, we noticed that your last commit message said "remove paper" and we just wanted to confirm whether this was your final decision. Would you be kind to verify this?

I confirm. Thanks.

hongsupshin commented 1 week ago

Thanks @janeadams for reviewing the paper!

scipy-conference / scipy_proceedings

Paper: NLP-in-the-real-world: Tool for building NLP solutions #921