Get SE words - Githubissues

sophieball commented 4 years ago

src/find_SE_words.py uses log odds ratio to find words that can distinguish a GH comments from other comments. I'm comparing between GH comments and another toxicity dataset.

To run it, call src/find_SE_words from feed_data.R. The input should be the comments from the random 10K code reviews. The output will be saved in bazel-bin. There are 2 output files: SE_words_G_zscores.csv is ngram and z-scores, SE_words_G.list is ngram only. I kept only ngrams with z-score >= 1.96

This list is used as the final step in Naveen's classifier: if a comment is predicted as toxic, replace these words with POTATO and rerun the perspective API. If the result changes, it implies that the initial toxic prediction is due to some GH specific terms.

@bvasiles One thing I've noticed: kill is not in the list, nor was it in Naveen's list. So I don't know how necessary/helpful is this step. Any other examples potentially harmful SE words?

CaptainEmerson commented 4 years ago

Traceback (most recent call last):
  File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/find_SE_words.runfiles/__main__/src/find_SE_words.py", line 6, in <module>
    import fighting_words_py3 as fighting
  File "/usr/local/google/home/emersonm/toxicity-detector/src/fighting_words_py3.py", line 3, in <module>
    from sklearn.feature_extraction.text import CountVectorizer as CV
  File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/find_SE_words.runfiles/deps_pypi__scikit_learn_0_23_2/sklearn/__init__.py", line 80, in <module>
    from .base import clone
  File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/find_SE_words.runfiles/deps_pypi__scikit_learn_0_23_2/sklearn/base.py", line 21, in <module>
    from .utils import _IS_32BIT
  File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/find_SE_words.runfiles/deps_pypi__scikit_learn_0_23_2/sklearn/utils/__init__.py", line 20, in <module>
    from scipy.sparse import issparse
ModuleNotFoundError: No module named 'scipy'

sophieball commented 4 years ago

Traceback (most recent call last):
  File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/find_SE_words.runfiles/__main__/src/find_SE_words.py", line 6, in <module>
    import fighting_words_py3 as fighting
  File "/usr/local/google/home/emersonm/toxicity-detector/src/fighting_words_py3.py", line 3, in <module>
    from sklearn.feature_extraction.text import CountVectorizer as CV
  File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/find_SE_words.runfiles/deps_pypi__scikit_learn_0_23_2/sklearn/__init__.py", line 80, in <module>
    from .base import clone
  File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/find_SE_words.runfiles/deps_pypi__scikit_learn_0_23_2/sklearn/base.py", line 21, in <module>
    from .utils import _IS_32BIT
  File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/find_SE_words.runfiles/deps_pypi__scikit_learn_0_23_2/sklearn/utils/__init__.py", line 20, in <module>
    from scipy.sparse import issparse
ModuleNotFoundError: No module named 'scipy'

Seems to be an indirect import issue. I've pushed new code in #79

sophieball commented 3 years ago

@CaptainEmerson You asked about the comment "move xx and xx to src/data". I talked about it here:

To run it, call src/find_SE_words from feed_data.R. The input should be the comments from the random 10K code reviews. The output will be saved in bazel-bin. There are 2 output files: SE_words_G_zscores.csv is ngram and z-scores, SE_words_G.list is ngram only. I kept only ngrams with z-score >= 1.96

You can close this issue after you save these two files. They are needed by Naveen's file to remove SE words that might mess up the perspective API

CaptainEmerson commented 3 years ago

Thanks for the explicit instructions. I have now saved those files.

sophieball / toxicity-detector

Get SE words #76