Closed sophieball closed 3 years ago
Traceback (most recent call last):
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/find_SE_words.runfiles/__main__/src/find_SE_words.py", line 6, in <module>
import fighting_words_py3 as fighting
File "/usr/local/google/home/emersonm/toxicity-detector/src/fighting_words_py3.py", line 3, in <module>
from sklearn.feature_extraction.text import CountVectorizer as CV
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/find_SE_words.runfiles/deps_pypi__scikit_learn_0_23_2/sklearn/__init__.py", line 80, in <module>
from .base import clone
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/find_SE_words.runfiles/deps_pypi__scikit_learn_0_23_2/sklearn/base.py", line 21, in <module>
from .utils import _IS_32BIT
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/find_SE_words.runfiles/deps_pypi__scikit_learn_0_23_2/sklearn/utils/__init__.py", line 20, in <module>
from scipy.sparse import issparse
ModuleNotFoundError: No module named 'scipy'
Traceback (most recent call last): File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/find_SE_words.runfiles/__main__/src/find_SE_words.py", line 6, in <module> import fighting_words_py3 as fighting File "/usr/local/google/home/emersonm/toxicity-detector/src/fighting_words_py3.py", line 3, in <module> from sklearn.feature_extraction.text import CountVectorizer as CV File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/find_SE_words.runfiles/deps_pypi__scikit_learn_0_23_2/sklearn/__init__.py", line 80, in <module> from .base import clone File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/find_SE_words.runfiles/deps_pypi__scikit_learn_0_23_2/sklearn/base.py", line 21, in <module> from .utils import _IS_32BIT File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/find_SE_words.runfiles/deps_pypi__scikit_learn_0_23_2/sklearn/utils/__init__.py", line 20, in <module> from scipy.sparse import issparse ModuleNotFoundError: No module named 'scipy'
Seems to be an indirect import issue. I've pushed new code in #79
@CaptainEmerson You asked about the comment "move xx and xx to src/data". I talked about it here:
To run it, call src/find_SE_words from feed_data.R. The input should be the comments from the random 10K code reviews. The output will be saved in bazel-bin. There are 2 output files: SE_words_G_zscores.csv is ngram and z-scores, SE_words_G.list is ngram only. I kept only ngrams with z-score >= 1.96
You can close this issue after you save these two files. They are needed by Naveen's file to remove SE words that might mess up the perspective API
Thanks for the explicit instructions. I have now saved those files.
src/find_SE_words.py
uses log odds ratio to find words that can distinguish a GH comments from other comments. I'm comparing between GH comments and another toxicity dataset.To run it, call
src/find_SE_words
fromfeed_data.R
. The input should be the comments from the random 10K code reviews. The output will be saved in bazel-bin. There are 2 output files:SE_words_G_zscores.csv
is ngram and z-scores,SE_words_G.list
is ngram only. I kept only ngrams with z-score >= 1.96This list is used as the final step in Naveen's classifier: if a comment is predicted as toxic, replace these words with
POTATO
and rerun the perspective API. If the result changes, it implies that the initial toxic prediction is due to some GH specific terms.@bvasiles One thing I've noticed:
kill
is not in the list, nor was it in Naveen's list. So I don't know how necessary/helpful is this step. Any other examples potentially harmful SE words?