Closed sophieball closed 4 years ago
@CaptainEmerson
Can you report results in the following 4 files produced by src/convo_word_freq_diff.py
in PR #41:
bazel-bin/main/feed_data.runfiles/__main__/fighting_words_freq.csv
. bazel-bin/main/feed_data.runfiles/__main__/politeness_wrds_marked_sorted.txt
, sorted based on the number of times they are marked in the corpus. The list may be very long, but maybe they can give us some ideas on what are the domain-specific stopwords to remove?Traceback (most recent call last):
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/__main__/src/convo_word_freq_diff.py", line 8, in <module>
import download_data
File "/usr/local/google/home/emersonm/toxicity-detector/src/download_data.py", line 6, in <module>
import spacy
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps_pypi__spacy_2_2_4/spacy/__init__.py", line 10, in <module>
from thinc.neural.util import prefer_gpu, require_gpu
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps_pypi__thinc_7_4_1/thinc/__init__.py", line 5, in <module>
import numpy # noqa: F401
ModuleNotFoundError: No module named 'numpy'
Seems to be an indirect dependency issue #43
src/convo_word_freq_diff.py
reads data from stdin. I use the same R script to call the bazel py_binary
Traceback (most recent call last):
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/__main__/src/convo_word_freq_diff.py", line 12, in <module>
from convokit import Corpus, Speaker, Utterance
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps_pypi__convokit_2_3_2_3/convokit/__init__.py", line 1, in <module>
from .model import *
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps_pypi__convokit_2_3_2_3/convokit/model/__init__.py", line 1, in <module>
from .conversation import Conversation
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps_pypi__convokit_2_3_2_3/convokit/model/conversation.py", line 2, in <module>
from .utterance import Utterance
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps_pypi__convokit_2_3_2_3/convokit/model/utterance.py", line 4, in <module>
from .speaker import Speaker
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps_pypi__convokit_2_3_2_3/convokit/model/speaker.py", line 5, in <module>
import pandas as pd
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps_pypi__pandas_1_0_5/pandas/__init__.py", line 17, in <module>
"Unable to import required dependencies:\n" + "\n".join(missing_dependencies)
ImportError: Unable to import required dependencies:
dateutil: No module named 'dateutil'
(FYI, you can reassign back to me once you're done on your end.)
Oh... this is definitely indirect dependency problem =( I wonder is it because I've set something somewhere so that I don't have to specify them?
Looks like that figure issue is coming back to bite us:
Initializing default CountVectorizer with ngram_range (1, 5)... Done.
class1_func returned 504 valid utterances. class2_func returned 759 valid utterances.
Vocab size is 8478
Comparing language...
ngram zscores computed.
/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps_pypi__convokit_2_3_2_4/convokit/politenessStrategies/politenessStrategies.py:94: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure.
plt.show()
plt.show()
to plt.savefig()
.Updated.
@CaptainEmerson After you clean up the text a bit and approve my new PR #49 on removing unigram from the analysis, can you run convo_word_freq_diff
again and look at the fighting words?
Text cleaned up. Looks a little more interesting.
I've put in this form, after different ways of preprocessing the text, the top 20 fighting words.
Some interesting points: