Closed sophieball closed 4 years ago
Comparing these two histograms, I would argue again if we want to remove stop words at all, you
shouldn't be removed.
Do you have the distribution of words per comment across both toxic and non-toxic?
Do you have the distribution of words per comment across both toxic and non-toxic?
Do you mean number of words of number of these words?
Num words
On Wed, Jun 24, 2020 at 1:06 PM Huilian Sophie Qiu notifications@github.com wrote:
Do you have the distribution of words per comment across both toxic and non-toxic?
Do you mean number of words of number of these words?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sophieball/toxicity-detector/issues/29#issuecomment-648946939, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOUGIAAMYCBKDBCRKT2B7DRYIXCBANCNFSM4OBBPORA .
@bvasiles toxic["words"].describe() count 111.000000 mean 97.864865 std 194.339229 min 1.000000 25% 23.500000 50% 49.000000 75% 95.000000 max 1649.000000
non-toxic non_toxic["words"].describe() count 3420.000000 mean 67.002339 std 251.652176 min 1.000000 25% 16.000000 50% 31.000000 75% 66.250000 max 9489.000000
Am I reading this correctly? 111 comments labeled toxic, 3420 labeled non-toxic?
Yes.. @bvasiles
I tried a random forest classifier with max_depth = 2, outcome is toxic
, input are the significant politeness features from the logistic regression
precision recall f1-score support
0.0 0.96 1.00 0.98 679
1.0 1.00 0.04 0.07 28
accuracy 0.96 707
macro avg 0.98 0.52 0.52 707
weighted avg 0.96 0.96 0.94 707
That seems higher than before?
On Fri, Jun 26, 2020 at 12:57 PM Huilian Sophie Qiu < notifications@github.com> wrote:
I tried a random forest classifier with max_depth = 2, outcome is toxic, input are the significant politeness features from the logistic regression precision recall f1-score support
0.0 0.96 1.00 0.98 679 1.0 1.00 0.04 0.07 28
accuracy 0.96 707
macro avg 0.98 0.52 0.52 707 weighted avg 0.96 0.96 0.94 707
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sophieball/toxicity-detector/issues/29#issuecomment-650286591, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOUGIB4NLWM3Q3PZF363BDRYTHRNANCNFSM4OBBPORA .
1 recall is low.. but .. yea.. better
@CaptainEmerson Can you run the R code I pushed in PR40, which will create politeness features of the data and run a logistic regression. The model summary and R2 score will be saved in bazel-bin/main/feed_data.runfiles/__main__/politeness_logi.out
.
(myproject) emersonm@emersonm:~/toxicity-detector$ bazel build //main:politeness_logi
INFO: Analyzed target //main:politeness_logi (79 packages loaded, 21217 targets configured).
INFO: Found 1 target...
ERROR: /usr/local/google/home/emersonm/.cache/bazel/_bazel_emersonm/944306df6cbfee92be3237efb3eb9146/external/R_Rcpp/BUILD.bazel:10:6: Building R package Rcpp failed (Exit 1) build.sh failed: error executing command external/com_grail_rules_r/R/scripts/build.sh
Use --sandbox_debug to see verbose messages from the sandbox
* installing *source* package ‘Rcpp’ ...
** libs
g++ -I/usr/lib/r-google/include -DNDEBUG -I../inst/include/ -I/usr/local/include -Wno-builtin-macro-redefined -D__DATE__="redacted" -D__TIMESTAMP__="redacted" -D__TIME__="redacted" -fdebug-prefix-map="/usr/local/google/home/emersonm/.cache/bazel/_bazel_emersonm/944306df6cbfee92be3237efb3eb9146/sandbox/linux-sandbox/36/execroot/__main__/=" -fpic -g -O2 -c Date.cpp -o Date.o
g++ -I/usr/lib/r-google/include -DNDEBUG -I../inst/include/ -I/usr/local/include -Wno-builtin-macro-redefined -D__DATE__="redacted" -D__TIMESTAMP__="redacted" -D__TIME__="redacted" -fdebug-prefix-map="/usr/local/google/home/emersonm/.cache/bazel/_bazel_emersonm/944306df6cbfee92be3237efb3eb9146/sandbox/linux-sandbox/36/execroot/__main__/=" -fpic -g -O2 -c Module.cpp -o Module.o
g++ -I/usr/lib/r-google/include -DNDEBUG -I../inst/include/ -I/usr/local/include -Wno-builtin-macro-redefined -D__DATE__="redacted" -D__TIMESTAMP__="redacted" -D__TIME__="redacted" -fdebug-prefix-map="/usr/local/google/home/emersonm/.cache/bazel/_bazel_emersonm/944306df6cbfee92be3237efb3eb9146/sandbox/linux-sandbox/36/execroot/__main__/=" -fpic -g -O2 -c Rcpp_init.cpp -o Rcpp_init.o
g++ -I/usr/lib/r-google/include -DNDEBUG -I../inst/include/ -I/usr/local/include -Wno-builtin-macro-redefined -D__DATE__="redacted" -D__TIMESTAMP__="redacted" -D__TIME__="redacted" -fdebug-prefix-map="/usr/local/google/home/emersonm/.cache/bazel/_bazel_emersonm/944306df6cbfee92be3237efb3eb9146/sandbox/linux-sandbox/36/execroot/__main__/=" -fpic -g -O2 -c api.cpp -o api.o
g++ -I/usr/lib/r-google/include -DNDEBUG -I../inst/include/ -I/usr/local/include -Wno-builtin-macro-redefined -D__DATE__="redacted" -D__TIMESTAMP__="redacted" -D__TIME__="redacted" -fdebug-prefix-map="/usr/local/google/home/emersonm/.cache/bazel/_bazel_emersonm/944306df6cbfee92be3237efb3eb9146/sandbox/linux-sandbox/36/execroot/__main__/=" -fpic -g -O2 -c attributes.cpp -o attributes.o
g++ -I/usr/lib/r-google/include -DNDEBUG -I../inst/include/ -I/usr/local/include -Wno-builtin-macro-redefined -D__DATE__="redacted" -D__TIMESTAMP__="redacted" -D__TIME__="redacted" -fdebug-prefix-map="/usr/local/google/home/emersonm/.cache/bazel/_bazel_emersonm/944306df6cbfee92be3237efb3eb9146/sandbox/linux-sandbox/36/execroot/__main__/=" -fpic -g -O2 -c barrier.cpp -o barrier.o
g++ -shared -L/usr/lib/r-google/lib -L/usr/local/lib -o Rcpp.so Date.o Module.o Rcpp_init.o api.o attributes.o barrier.o -L/usr/lib/r-google/lib -lR
installing to /tmp/bazel/R/lib_Rcpp/Rcpp/libs
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
converting help for package ‘Rcpp’
finding HTML links ... done
CppClass-class html
CppConstructor-class html
CppField-class html
CppFunction-class html
CppObject-class html
CppOverloadedMethods-class html
DollarNames-methods html
Module-class html
Module html
Rcpp-deprecated html
Rcpp-internal html
Rcpp-package html
Rcpp.package.skeleton html
Rcpp.plugin.maker html
RcppLdFlags html
RcppUnitTests html
compileAttributes html
compilerCheck html
cppFunction html
demangle html
dependsAttribute html
evalCpp html
exportAttribute html
exposeClass html
formals html
interfacesAttribute html
loadModule html
loadRcppModules-deprecated html
pluginsAttribute html
populate html
registerPlugin html
setRcppClass html
sourceCpp html
** building package indices
** installing vignettes
** testing if installed package can be loaded
Error: package or namespace load failed for ‘Rcpp’ in dyn.load(file, DLLpath = DLLpath, ...):
unable to load shared object '/tmp/bazel/R/lib_Rcpp/Rcpp/libs/Rcpp.so':
/usr/lib/r-google/bin/exec/../../crosstool_lib/libstdc++.so.6: version `GLIBCXX_3.4.26' not found (required by /tmp/bazel/R/lib_Rcpp/Rcpp/libs/Rcpp.so)
Error: loading failed
Execution halted
ERROR: loading failed
* removing ‘/tmp/bazel/R/lib_Rcpp/Rcpp’
Target //main:politeness_logi failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 53.826s, Critical Path: 49.62s
INFO: 19 processes: 19 linux-sandbox.
FAILED: Build did NOT complete successfully
(myproject) emersonm@emersonm:~/toxicity-detector$
@CaptainEmerson This seems to be the same problem: http://lists.r-forge.r-project.org/pipermail/rcpp-devel/2016-March/009148.html This might be a fix: https://stackoverflow.com/questions/16605623/where-can-i-get-a-copy-of-the-file-libstdc-so-6-0-15
The SO fix didn't work. I realized I use a Google-specific version of R that loads packages in a non-standard way, so that's probably not playing nicely with whatever bazel's trying to install.
The solution that seems to work is just to call the R file directly:
system2("Rscript",
"~/toxicity-detector/main/politeness_logi.R",
input = format_csv(df))
(But now I get the missing numpty dep.)
In any case, if I'm going to do that and travis isn't going to build the R target, then it feels like we might as well not use bazel at all for R. We can discuss tomorrow.
Two things:
Error in family$linkfun(mustart) :
Argument mu must be a nonempty numeric vector
Calls: glm -> eval -> eval -> glm.fit -> <Anonymous> -> .Call
In addition: Warning message:
In Ops.factor(Please, 0) : ‘>’ not meaningful for factors
Execution halted
@CaptainEmerson Can you run the R code in PR #48? The output of politeness word counts from python will be in politeness_features.csv
and the logistic regression result will be in politeness_logi.out
Error in 1 + length : non-numeric argument to binary operator
Calls: glm ... eval -> <Anonymous> -> model.frame.default -> eval -> eval
Execution halted
@CaptainEmerson
Error in 1 + length : non-numeric argument to binary operator Calls: glm ... eval -> <Anonymous> -> model.frame.default -> eval -> eval Execution halted
In poilnteness_logi.out
, at the beginning of the file, ignoring the Min.... Max.
row, there's a table of variable names and their datatypes. What's the datatype of length
?
poilnteness_logi.out is empty. :/
I thought that perhaps it was the need to add a sink() at the bottom of that R file, but adding one didn't seem to help.
What's the path to politeness_logi.out
? In my case, Bazel puts it in bazel-bin/main/feed_data.runfiles/__main__/politeness_logi.out
. bazel-bin/main/feed_data.runfiles/__main__/
is where the feed_data bazel binary locates.... Hmm. I should just say politeness_logi.out
in the program output..
Aha! Great guess -- I was looking at the wrong one. Here's the data:
X_id Please Please_start HASHEDGE
"integer" "integer" "integer" "integer"
Indirect_.btw. Hedges Factuality Deference
"integer" "integer" "integer" "integer"
Gratitude Apologizing X1st_person_pl. X1st_person
"integer" "integer" "integer" "integer"
X1st_person_start X2nd_person X2nd_person_start Indirect_.greeting.
"integer" "integer" "integer" "integer"
Direct_question Direct_start HASPOSITIVE HASNEGATIVE
"integer" "integer" "integer" "integer"
SUBJUNCTIVE INDICATIVE label
"integer" "integer" "numeric"
Oh.. there's no length
.. To follow the practice of using more fine-grained issues, I'll close this one and open a new one for this problem only.
Closing this one. Moving to the new issue #50
I'm closing this one. The result looks promising. Looking forward to hyperparam tuning results.
The politeness analysis provided by convokit can break down the text into 21 different politeness strategies. In the histograms below, x-axis is different politeness strategies (hasnegative maybe the inpoliteness strategy) and y-axis is on average, within each comment, how many words labeled as that strategy can be found.
Here's the histogram of politeness strategies among toxic comments:
Here's the histogram of politeness strategies among non-toxic comments:
I've put the words that are marked for each strategy in this doc.