sophieball commented 4 years ago

I've put in this form, after different ways of preprocessing the text, the top 20 fighting words.

Some interesting points:

"You" is more often used in toxic comments.
But after removing stop words, "you" is gone.
We probably want to keep "you" because that's one of the politeness strategy.

sophieball commented 4 years ago

@CaptainEmerson Can you report results in the following 4 files produced by src/convo_word_freq_diff.py in PR #41:

(1 csv) A list of fighting words will be saved in bazel-bin/main/feed_data.runfiles/__main__/fighting_words_freq.csv.
(2 plots) The same code also generates histograms of politeness strategies. Right now you need to manually save the plots - the first one is label==1 and the second label==0.
(1 txt) It also writes the words marked as different politeness strategy to bazel-bin/main/feed_data.runfiles/__main__/politeness_wrds_marked_sorted.txt, sorted based on the number of times they are marked in the corpus. The list may be very long, but maybe they can give us some ideas on what are the domain-specific stopwords to remove?

CaptainEmerson commented 4 years ago

Traceback (most recent call last):
  File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/__main__/src/convo_word_freq_diff.py", line 8, in <module>
    import download_data
  File "/usr/local/google/home/emersonm/toxicity-detector/src/download_data.py", line 6, in <module>
    import spacy
  File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps_pypi__spacy_2_2_4/spacy/__init__.py", line 10, in <module>
    from thinc.neural.util import prefer_gpu, require_gpu
  File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps_pypi__thinc_7_4_1/thinc/__init__.py", line 5, in <module>
    import numpy  # noqa: F401
ModuleNotFoundError: No module named 'numpy'

sophieball commented 4 years ago

Seems to be an indirect dependency issue #43

sophieball commented 4 years ago

src/convo_word_freq_diff.py reads data from stdin. I use the same R script to call the bazel py_binary

CaptainEmerson commented 4 years ago

Traceback (most recent call last):
  File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/__main__/src/convo_word_freq_diff.py", line 12, in <module>
    from convokit import Corpus, Speaker, Utterance
  File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps_pypi__convokit_2_3_2_3/convokit/__init__.py", line 1, in <module>
    from .model import *
  File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps_pypi__convokit_2_3_2_3/convokit/model/__init__.py", line 1, in <module>
    from .conversation import Conversation
  File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps_pypi__convokit_2_3_2_3/convokit/model/conversation.py", line 2, in <module>
    from .utterance import Utterance
  File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps_pypi__convokit_2_3_2_3/convokit/model/utterance.py", line 4, in <module>
    from .speaker import Speaker
  File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps_pypi__convokit_2_3_2_3/convokit/model/speaker.py", line 5, in <module>
    import pandas as pd
  File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps_pypi__pandas_1_0_5/pandas/__init__.py", line 17, in <module>
    "Unable to import required dependencies:\n" + "\n".join(missing_dependencies)
ImportError: Unable to import required dependencies:
dateutil: No module named 'dateutil'

CaptainEmerson commented 4 years ago

(FYI, you can reassign back to me once you're done on your end.)

sophieball commented 4 years ago

Oh... this is definitely indirect dependency problem =( I wonder is it because I've set something somewhere so that I don't have to specify them?

CaptainEmerson commented 4 years ago

Looks like that figure issue is coming back to bite us:

Initializing default CountVectorizer with ngram_range (1, 5)... Done.
class1_func returned 504 valid utterances. class2_func returned 759 valid utterances.
Vocab size is 8478
Comparing language...
ngram zscores computed.
/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps_pypi__convokit_2_3_2_4/convokit/politenessStrategies/politenessStrategies.py:94: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure.
  plt.show()

sophieball commented 4 years ago

45 I copied their code for plotting the histograms and changed `plt.show()` to `plt.savefig()`.

CaptainEmerson commented 4 years ago

https://drive.google.com/corp/drive/folders/1I3SBfTqNHFJM1nS9kufmGdJvlsKYMzUB

CaptainEmerson commented 4 years ago

Updated.

sophieball commented 4 years ago

@CaptainEmerson After you clean up the text a bit and approve my new PR #49 on removing unigram from the analysis, can you run convo_word_freq_diff again and look at the fighting words?

CaptainEmerson commented 4 years ago

Text cleaned up. Looks a little more interesting.

sophieball / toxicity-detector

Top fighting words between toxic/non-toxic GH comments #28

45 I copied their code for plotting the histograms and changed `plt.show()` to `plt.savefig()`.

sophieball / toxicity-detector

Top fighting words between toxic/non-toxic GH comments #28

45 I copied their code for plotting the histograms and changed plt.show() to plt.savefig().

45 I copied their code for plotting the histograms and changed `plt.show()` to `plt.savefig()`.