Closed sophieball closed 2 years ago
Sophie, do you want me to run this as well? If so, what target do you want me to run, and what output do you need?
Right! Should've mentioned.
Please run it on both pushback and linguistic datasets. The target is train_classifier_g
and the outputs should be the 2 .log
files, the 3 roc_curve_*.png
files, and features_xxxx.csv
. Thanks~
Also, @CaptainEmerson , when I was plotting feature importance, I noticed that the Google results do not have some politeness strategies, such as, apologizing, btw.. I used to drop them in convo_politeness.py
but not anymore.. When you run the current code, can you check if Apologizing
is among the features? Just a search in the .log
file is sufficient.
Traceback (most recent call last):
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/main/train_classifier_g.runfiles/__main__/main/train_classifier_g.py", line 16, in <module>
from src import suite
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/main/train_classifier_g.runfiles/__main__/src/suite.py", line 21, in <module>
from src import convo_politeness
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/main/train_classifier_g.runfiles/__main__/src/convo_politeness.py", line 37, in <module>
f = open("src/data/speakers_bots_full.list")
FileNotFoundError: [Errno 2] No such file or directory: 'src/data/speakers_bots_full.list'
Traceback (most recent call last):
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/__main__/src/convo_word_freq_diff.py", line 9, in <module>
from convokit import Corpus, Speaker, Utterance
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps/pypi__convokit/convokit/__init__.py", line 4, in <module>
from .politenessStrategies import *
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps/pypi__convokit/convokit/politenessStrategies/__init__.py", line 1, in <module>
from .politenessStrategies import *
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps/pypi__convokit/convokit/politenessStrategies/politenessStrategies.py", line 5, in <module>
from convokit.text_processing.textParser import process_text
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps/pypi__convokit/convokit/text_processing/__init__.py", line 2, in <module>
from .textParser import *
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps/pypi__convokit/convokit/text_processing/textParser.py", line 2, in <module>
import spacy
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps/pypi__spacy/spacy/__init__.py", line 10, in <module>
from thinc.api import prefer_gpu, require_gpu, require_cpu # noqa: F401
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps/pypi__thinc/thinc/api.py", line 6, in <module>
from .model import Model, serialize_attr, deserialize_attr
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps/pypi__thinc/thinc/model.py", line 13, in <module>
from .shims import Shim
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps/pypi__thinc/thinc/shims/__init__.py", line 2, in <module>
from .pytorch import PyTorchShim
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps/pypi__thinc/thinc/shims/pytorch.py", line 18, in <module>
from .pytorch_grad_scaler import PyTorchGradScaler
ModuleNotFoundError: No module named 'thinc.shims.pytorch_grad_scaler'
There were 22 warnings (use warnings() to see them)
I tried the current commit in a new directory.. It should work.. No rush, though. This new output won't be too different from the previous one
Right! Should've mentioned. Please run it on both pushback and linguistic datasets. The target is
train_classifier_g
and the outputs should be the 2.log
files, the 3roc_curve_*.png
files, andfeatures_xxxx.csv
. Thanks~
Added a ROC curve per Bogdan's suggestion. There will be 3 `.png' files.
Traceback (most recent call last):
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/__main__/src/convo_word_freq_diff.py", line 9, in <module>
from convokit import Corpus, Speaker, Utterance
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps/pypi__convokit/convokit/__init__.py", line 4, in <module>
from .politenessStrategies import *
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps/pypi__convokit/convokit/politenessStrategies/__init__.py", line 1, in <module>
from .politenessStrategies import *
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps/pypi__convokit/convokit/politenessStrategies/politenessStrategies.py", line 5, in <module>
from convokit.text_processing.textParser import process_text
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps/pypi__convokit/convokit/text_processing/__init__.py", line 2, in <module>
from .textParser import *
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps/pypi__convokit/convokit/text_processing/textParser.py", line 2, in <module>
import spacy
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps/pypi__spacy/spacy/__init__.py", line 10, in <module>
from thinc.api import prefer_gpu, require_gpu, require_cpu # noqa: F401
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps/pypi__thinc/thinc/api.py", line 6, in <module>
from .model import Model, serialize_attr, deserialize_attr
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps/pypi__thinc/thinc/model.py", line 13, in <module>
from .shims import Shim
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps/pypi__thinc/thinc/shims/__init__.py", line 2, in <module>
from .pytorch import PyTorchShim
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps/pypi__thinc/thinc/shims/pytorch.py", line 18, in <module>
from .pytorch_grad_scaler import PyTorchGradScaler
ModuleNotFoundError: No module named 'thinc.shims.pytorch_grad_scaler'
Probably because we previously limited thinc's version. It is in the newest version: https://github.com/explosion/thinc/blob/master/thinc/shims/pytorch_grad_scaler.py
I removed the constraint
Traceback (most recent call last):
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/__main__/src/convo_word_freq_diff.py", line 9, in <module>
from convokit import Corpus, Speaker, Utterance
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps/pypi__convokit/convokit/__init__.py", line 4, in <module>
from .politenessStrategies import *
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps/pypi__convokit/convokit/politenessStrategies/__init__.py", line 1, in <module>
from .politenessStrategies import *
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps/pypi__convokit/convokit/politenessStrategies/politenessStrategies.py", line 5, in <module>
from convokit.text_processing.textParser import process_text
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps/pypi__convokit/convokit/text_processing/__init__.py", line 2, in <module>
from .textParser import *
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps/pypi__convokit/convokit/text_processing/textParser.py", line 2, in <module>
import spacy
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps/pypi__spacy/spacy/__init__.py", line 10, in <module>
from thinc.api import prefer_gpu, require_gpu, require_cpu # noqa: F401
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps/pypi__thinc/thinc/api.py", line 6, in <module>
from .model import Model, serialize_attr, deserialize_attr
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps/pypi__thinc/thinc/model.py", line 13, in <module>
from .shims import Shim
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps/pypi__thinc/thinc/shims/__init__.py", line 2, in <module>
from .pytorch import PyTorchShim
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/src/convo_word_freq_diff.runfiles/deps/pypi__thinc/thinc/shims/pytorch.py", line 18, in <module>
from .pytorch_grad_scaler import PyTorchGradScaler
ModuleNotFoundError: No module named 'thinc.shims.pytorch_grad_scaler'
:/ I force it to be the latest version..
I still have that error. It does look like train_classifier.log is created.
Did you try bazel clean
then rebuild?
@CaptainEmerson When you finish the current run, can you pull the newest commit and run it again? I need all four results (previous commit with G and G-ling, new commit with G and G-ling). Thanks!
The dependency issue is fixed, it looks like. But I get:
[nltk_data] Package averaged_perceptron_tagger is already up-to-
[nltk_data] date!
Log saved in `/usr/local/google/home/emersonm/toxicity-detector/train_classifier.log`
sh: line 1: bazel-bin/src/convo_word_freq_diff: No such file or directory
I'm running from here:
setwd("~/toxicity-detector")
Can you see if src/convo_word_freq_diff.py
is there? It's in the repo
Also, no need to run the old code if you haven't done so. Just run the most up-to-date version.
The difference is that t-tests show that adjusting for SE words degrades the results.. so I'm reporting results before adjustment.. otherwise I don't know how to defend why we do the adjustment
Can you see if src/convo_word_freq_diff.py is there? It's in the repo
Yes, but it's not in bazel-bin. Does main/train_classifier_g depend on src/convo_word_freq_diff.py?
Yes, it's under feed_data
r_binary(
name = "feed_data",
src = "feed_data.R",
data = [
":train_classifier_g",
":train_polite_score",
"//src:convo_word_freq_diff",
"//src:find_SE_words",
"//main:train_prompt_types",
"//main:train_polite_prompt_classifier",
],
deps = [
":politeness_logi",
"@R_plyr",
"@R_readr",
],
)
But train_classifier_g doesn't depend on feed_data or convo_word_freq_diff, right?
Right! feed_data
is my own thing. You have something else. Nothing you need to run depends on word_freq_diff
Ah, you are right, I think that the dependencies are fine. Do you need new results from convo_word_freq_diff?
No. I only need the .log
and the .png
s
The pngs are getting overwritten on each run, right? I did a run yesterday/last night, and it ran both the regular version and then the linguistic version. So while I have the log files from both runs, I guess the PNGs I have are only the linguistic ones.
I can reconstruct graphs from logs
I've uploaded the png files and the logs. I've started a run again with the new build.
Some notes:
no. just the most recent version on both datasets.
I'm changing the logging.
Removed to_csv
I've upload the newest logs and pngs. I think that's all you need from me right now?
Yes! Thanks
@CaptainEmerson Can I merge this?
yep, sg
Perform t-test on test data rather than training data. The t-stats are qualitatively the same (same direction) but some of them are not significant on the held-out test data.