sophieball / toxicity-detector

MIT License
0 stars 0 forks source link

Politeness strategies #29

Closed sophieball closed 4 years ago

sophieball commented 4 years ago

The politeness analysis provided by convokit can break down the text into 21 different politeness strategies. In the histograms below, x-axis is different politeness strategies (hasnegative maybe the inpoliteness strategy) and y-axis is on average, within each comment, how many words labeled as that strategy can be found.

Here's the histogram of politeness strategies among toxic comments: polite_toxic_with_stopwords

Here's the histogram of politeness strategies among non-toxic comments: polite_non_toxic_with_stopwords

I've put the words that are marked for each strategy in this doc.

sophieball commented 4 years ago

Comparing these two histograms, I would argue again if we want to remove stop words at all, you shouldn't be removed.

bvasiles commented 4 years ago

Do you have the distribution of words per comment across both toxic and non-toxic?

sophieball commented 4 years ago

Do you have the distribution of words per comment across both toxic and non-toxic?

Do you mean number of words of number of these words?

bvasiles commented 4 years ago

Num words

On Wed, Jun 24, 2020 at 1:06 PM Huilian Sophie Qiu notifications@github.com wrote:

Do you have the distribution of words per comment across both toxic and non-toxic?

Do you mean number of words of number of these words?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sophieball/toxicity-detector/issues/29#issuecomment-648946939, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOUGIAAMYCBKDBCRKT2B7DRYIXCBANCNFSM4OBBPORA .

sophieball commented 4 years ago

@bvasiles toxic["words"].describe() count 111.000000 mean 97.864865 std 194.339229 min 1.000000 25% 23.500000 50% 49.000000 75% 95.000000 max 1649.000000

non-toxic non_toxic["words"].describe() count 3420.000000 mean 67.002339 std 251.652176 min 1.000000 25% 16.000000 50% 31.000000 75% 66.250000 max 9489.000000

bvasiles commented 4 years ago

Am I reading this correctly? 111 comments labeled toxic, 3420 labeled non-toxic?

sophieball commented 4 years ago

Yes.. @bvasiles

sophieball commented 4 years ago

I tried a random forest classifier with max_depth = 2, outcome is toxic, input are the significant politeness features from the logistic regression

              precision    recall  f1-score   support

         0.0       0.96      1.00      0.98       679
         1.0       1.00      0.04      0.07        28

    accuracy                           0.96       707
   macro avg       0.98      0.52      0.52       707
weighted avg       0.96      0.96      0.94       707
bvasiles commented 4 years ago

That seems higher than before?

On Fri, Jun 26, 2020 at 12:57 PM Huilian Sophie Qiu < notifications@github.com> wrote:

I tried a random forest classifier with max_depth = 2, outcome is toxic, input are the significant politeness features from the logistic regression precision recall f1-score support

 0.0       0.96      1.00      0.98       679
 1.0       1.00      0.04      0.07        28

accuracy 0.96 707

macro avg 0.98 0.52 0.52 707 weighted avg 0.96 0.96 0.94 707

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sophieball/toxicity-detector/issues/29#issuecomment-650286591, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOUGIB4NLWM3Q3PZF363BDRYTHRNANCNFSM4OBBPORA .

sophieball commented 4 years ago

1 recall is low.. but .. yea.. better

sophieball commented 4 years ago

@CaptainEmerson Can you run the R code I pushed in PR40, which will create politeness features of the data and run a logistic regression. The model summary and R2 score will be saved in bazel-bin/main/feed_data.runfiles/__main__/politeness_logi.out.

CaptainEmerson commented 4 years ago
(myproject) emersonm@emersonm:~/toxicity-detector$ bazel build //main:politeness_logi 
INFO: Analyzed target //main:politeness_logi (79 packages loaded, 21217 targets configured).
INFO: Found 1 target...
ERROR: /usr/local/google/home/emersonm/.cache/bazel/_bazel_emersonm/944306df6cbfee92be3237efb3eb9146/external/R_Rcpp/BUILD.bazel:10:6: Building R package Rcpp failed (Exit 1) build.sh failed: error executing command external/com_grail_rules_r/R/scripts/build.sh

Use --sandbox_debug to see verbose messages from the sandbox
* installing *source* package ‘Rcpp’ ...
** libs
g++  -I/usr/lib/r-google/include -DNDEBUG -I../inst/include/  -I/usr/local/include -Wno-builtin-macro-redefined -D__DATE__="redacted" -D__TIMESTAMP__="redacted" -D__TIME__="redacted" -fdebug-prefix-map="/usr/local/google/home/emersonm/.cache/bazel/_bazel_emersonm/944306df6cbfee92be3237efb3eb9146/sandbox/linux-sandbox/36/execroot/__main__/="   -fpic  -g -O2  -c Date.cpp -o Date.o
g++  -I/usr/lib/r-google/include -DNDEBUG -I../inst/include/  -I/usr/local/include -Wno-builtin-macro-redefined -D__DATE__="redacted" -D__TIMESTAMP__="redacted" -D__TIME__="redacted" -fdebug-prefix-map="/usr/local/google/home/emersonm/.cache/bazel/_bazel_emersonm/944306df6cbfee92be3237efb3eb9146/sandbox/linux-sandbox/36/execroot/__main__/="   -fpic  -g -O2  -c Module.cpp -o Module.o
g++  -I/usr/lib/r-google/include -DNDEBUG -I../inst/include/  -I/usr/local/include -Wno-builtin-macro-redefined -D__DATE__="redacted" -D__TIMESTAMP__="redacted" -D__TIME__="redacted" -fdebug-prefix-map="/usr/local/google/home/emersonm/.cache/bazel/_bazel_emersonm/944306df6cbfee92be3237efb3eb9146/sandbox/linux-sandbox/36/execroot/__main__/="   -fpic  -g -O2  -c Rcpp_init.cpp -o Rcpp_init.o
g++  -I/usr/lib/r-google/include -DNDEBUG -I../inst/include/  -I/usr/local/include -Wno-builtin-macro-redefined -D__DATE__="redacted" -D__TIMESTAMP__="redacted" -D__TIME__="redacted" -fdebug-prefix-map="/usr/local/google/home/emersonm/.cache/bazel/_bazel_emersonm/944306df6cbfee92be3237efb3eb9146/sandbox/linux-sandbox/36/execroot/__main__/="   -fpic  -g -O2  -c api.cpp -o api.o
g++  -I/usr/lib/r-google/include -DNDEBUG -I../inst/include/  -I/usr/local/include -Wno-builtin-macro-redefined -D__DATE__="redacted" -D__TIMESTAMP__="redacted" -D__TIME__="redacted" -fdebug-prefix-map="/usr/local/google/home/emersonm/.cache/bazel/_bazel_emersonm/944306df6cbfee92be3237efb3eb9146/sandbox/linux-sandbox/36/execroot/__main__/="   -fpic  -g -O2  -c attributes.cpp -o attributes.o
g++  -I/usr/lib/r-google/include -DNDEBUG -I../inst/include/  -I/usr/local/include -Wno-builtin-macro-redefined -D__DATE__="redacted" -D__TIMESTAMP__="redacted" -D__TIME__="redacted" -fdebug-prefix-map="/usr/local/google/home/emersonm/.cache/bazel/_bazel_emersonm/944306df6cbfee92be3237efb3eb9146/sandbox/linux-sandbox/36/execroot/__main__/="   -fpic  -g -O2  -c barrier.cpp -o barrier.o
g++ -shared -L/usr/lib/r-google/lib -L/usr/local/lib -o Rcpp.so Date.o Module.o Rcpp_init.o api.o attributes.o barrier.o -L/usr/lib/r-google/lib -lR
installing to /tmp/bazel/R/lib_Rcpp/Rcpp/libs
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
  converting help for package ‘Rcpp’
    finding HTML links ... done
    CppClass-class                          html  
    CppConstructor-class                    html  
    CppField-class                          html  
    CppFunction-class                       html  
    CppObject-class                         html  
    CppOverloadedMethods-class              html  
    DollarNames-methods                     html  
    Module-class                            html  
    Module                                  html  
    Rcpp-deprecated                         html  
    Rcpp-internal                           html  
    Rcpp-package                            html  
    Rcpp.package.skeleton                   html  
    Rcpp.plugin.maker                       html  
    RcppLdFlags                             html  
    RcppUnitTests                           html  
    compileAttributes                       html  
    compilerCheck                           html  
    cppFunction                             html  
    demangle                                html  
    dependsAttribute                        html  
    evalCpp                                 html  
    exportAttribute                         html  
    exposeClass                             html  
    formals                                 html  
    interfacesAttribute                     html  
    loadModule                              html  
    loadRcppModules-deprecated              html  
    pluginsAttribute                        html  
    populate                                html  
    registerPlugin                          html  
    setRcppClass                            html  
    sourceCpp                               html  
** building package indices
** installing vignettes
** testing if installed package can be loaded
Error: package or namespace load failed for ‘Rcpp’ in dyn.load(file, DLLpath = DLLpath, ...):
 unable to load shared object '/tmp/bazel/R/lib_Rcpp/Rcpp/libs/Rcpp.so':
  /usr/lib/r-google/bin/exec/../../crosstool_lib/libstdc++.so.6: version `GLIBCXX_3.4.26' not found (required by /tmp/bazel/R/lib_Rcpp/Rcpp/libs/Rcpp.so)
Error: loading failed
Execution halted
ERROR: loading failed
* removing ‘/tmp/bazel/R/lib_Rcpp/Rcpp’
Target //main:politeness_logi failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 53.826s, Critical Path: 49.62s
INFO: 19 processes: 19 linux-sandbox.
FAILED: Build did NOT complete successfully
(myproject) emersonm@emersonm:~/toxicity-detector$ 
sophieball commented 4 years ago

@CaptainEmerson This seems to be the same problem: http://lists.r-forge.r-project.org/pipermail/rcpp-devel/2016-March/009148.html This might be a fix: https://stackoverflow.com/questions/16605623/where-can-i-get-a-copy-of-the-file-libstdc-so-6-0-15

CaptainEmerson commented 4 years ago

The SO fix didn't work. I realized I use a Google-specific version of R that loads packages in a non-standard way, so that's probably not playing nicely with whatever bazel's trying to install.

The solution that seems to work is just to call the R file directly:

system2("Rscript",
        "~/toxicity-detector/main/politeness_logi.R",
         input = format_csv(df))

(But now I get the missing numpty dep.)

In any case, if I'm going to do that and travis isn't going to build the R target, then it feels like we might as well not use bazel at all for R. We can discuss tomorrow.

CaptainEmerson commented 4 years ago

Two things:

Error in family$linkfun(mustart) : 
  Argument mu must be a nonempty numeric vector
Calls: glm -> eval -> eval -> glm.fit -> <Anonymous> -> .Call
In addition: Warning message:
In Ops.factor(Please, 0) : ‘>’ not meaningful for factors
Execution halted
sophieball commented 4 years ago

@CaptainEmerson Can you run the R code in PR #48? The output of politeness word counts from python will be in politeness_features.csv and the logistic regression result will be in politeness_logi.out

CaptainEmerson commented 4 years ago
Error in 1 + length : non-numeric argument to binary operator
Calls: glm ... eval -> <Anonymous> -> model.frame.default -> eval -> eval
Execution halted
sophieball commented 4 years ago

@CaptainEmerson

Error in 1 + length : non-numeric argument to binary operator
Calls: glm ... eval -> <Anonymous> -> model.frame.default -> eval -> eval
Execution halted

In poilnteness_logi.out, at the beginning of the file, ignoring the Min.... Max. row, there's a table of variable names and their datatypes. What's the datatype of length?

CaptainEmerson commented 4 years ago

poilnteness_logi.out is empty. :/

CaptainEmerson commented 4 years ago

I thought that perhaps it was the need to add a sink() at the bottom of that R file, but adding one didn't seem to help.

sophieball commented 4 years ago

What's the path to politeness_logi.out? In my case, Bazel puts it in bazel-bin/main/feed_data.runfiles/__main__/politeness_logi.out. bazel-bin/main/feed_data.runfiles/__main__/ is where the feed_data bazel binary locates.... Hmm. I should just say politeness_logi.out in the program output..

CaptainEmerson commented 4 years ago

Aha! Great guess -- I was looking at the wrong one. Here's the data:

               X_id              Please        Please_start            HASHEDGE 
          "integer"           "integer"           "integer"           "integer" 
     Indirect_.btw.              Hedges          Factuality           Deference 
          "integer"           "integer"           "integer"           "integer" 
          Gratitude         Apologizing     X1st_person_pl.         X1st_person 
          "integer"           "integer"           "integer"           "integer" 
  X1st_person_start         X2nd_person   X2nd_person_start Indirect_.greeting. 
          "integer"           "integer"           "integer"           "integer" 
    Direct_question        Direct_start         HASPOSITIVE         HASNEGATIVE 
          "integer"           "integer"           "integer"           "integer" 
        SUBJUNCTIVE          INDICATIVE               label 
          "integer"           "integer"           "numeric" 
sophieball commented 4 years ago

Oh.. there's no length.. To follow the practice of using more fine-grained issues, I'll close this one and open a new one for this problem only.

sophieball commented 4 years ago

Closing this one. Moving to the new issue #50

CaptainEmerson commented 4 years ago

https://docs.google.com/document/d/1kyd7OmjW368BT4gr6P_MsQZsRP9bGoxW6QG2MgccO6s/edit

sophieball commented 4 years ago

I'm closing this one. The result looks promising. Looking forward to hyperparam tuning results.