sophieball / toxicity-detector

MIT License
0 stars 0 forks source link

A classifier using prompt types and politeness strategies #75

Closed sophieball closed 4 years ago

sophieball commented 4 years ago
Traceback (most recent call last):
  File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/main/train_polite_prompt_classifier.runfiles/__main__/main/train_polite_prompt_classifier.py", line 61, in <module>
    comments_10K = pd.read_csv("src/data/random_sample_10000_prs_body_comments.csv")
  File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/main/train_polite_prompt_classifier.runfiles/deps_pypi__pandas_1_1_0/pandas/io/parsers.py", line 686, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/main/train_polite_prompt_classifier.runfiles/deps_pypi__pandas_1_1_0/pandas/io/parsers.py", line 452, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/main/train_polite_prompt_classifier.runfiles/deps_pypi__pandas_1_1_0/pandas/io/parsers.py", line 936, in __init__
    self._make_engine(self.engine)
  File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/main/train_polite_prompt_classifier.runfiles/deps_pypi__pandas_1_1_0/pandas/io/parsers.py", line 1168, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/main/train_polite_prompt_classifier.runfiles/deps_pypi__pandas_1_1_0/pandas/io/parsers.py", line 1998, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 361, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 653, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] No such file or directory: 'src/data/random_sample_10000_prs_body_comments.csv'

That file's not in my repo.

oh sorry forgot to clarify - this should be your sampled 10K CLs.

CaptainEmerson commented 4 years ago

Can I do that without writing the comments to disk?

sophieball commented 4 years ago

Can I do that without writing the comments to disk?

I removed dumping the model. I checked, I'm not writing the prompt type summary to disk.

But do you still have the output from main/train_prompt_types? There are some comments. You can totally remove them. But I'm just wondering if we can see the arcs (those things like don't* -> you_)? Those may help label prompt types (although they don't seem to be super helpful in our task - but we might still be able to answer why they are not as helpful as they are in predict conversation gone awry)

CaptainEmerson commented 4 years ago

I still have that output, so we can inspect those arcs.

But this line is problematic: comments_10K = pd.read_csv("src/data/random_sample_10000_prs_body_comments.csv")

Because we shouldn't store the comments in a CSV, but would rather read them from standard in.

sophieball commented 4 years ago

I still have that output, so we can inspect those arcs.

But this line is problematic: comments_10K = pd.read_csv("src/data/random_sample_10000_prs_body_comments.csv")

Because we shouldn't store the comments in a CSV, but would rather read them from standard in.

OH!! right... lemme see... how to pass in 2 df from R to py? (That's why I was doing this prediction in 2 steps...)

CaptainEmerson commented 4 years ago

You probably can't pass in two, but I could combine the tow. The format of the two are pretty similar, right? Wherever they're not similar, could just fill out nulls.

sophieball commented 4 years ago

@CaptainEmerson I added the conversation you gave me as one of the tests. I hope it works now..

sophieball commented 4 years ago

Before you merge, do you mean to check in all these files? main/pt_model_10K.files/* are all output files, right?Otherwise, the results in our shared directory.

I was thinking if after we run it the first time using 10K directly from server, we save those pt_models, then we won't need to run the 10K code again and again in the future. But now I think I can remove them because running 10K doesn't take too long; they might change the API in the future again; the dump files may contain comments.

I'll remove them before I merge.