sophieball / toxicity-detector

MIT License
0 stars 0 forks source link

Balance Total Comment Length Across Pushback vs. Non-Pushback for Fighting Words Analysis #65

Closed CaptainEmerson closed 4 years ago

CaptainEmerson commented 4 years ago

Per discussion with Carolyn and Sophie, differential n-gram frequency may simply reflect the fact that there's more text in high pushback reviews. Here, I'll attempt to correct for through using balanced samples.

I suppose I'll use the MatchIt package.

One threat is that the matched samples will be too small.

CaptainEmerson commented 4 years ago

Here's a sketch of the code I wrote:

library(MatchIt)

# Count the number of words per comment
df$comment_length <-lengths(strsplit(df$text, "\\W+"))

# Pair every labeled CL with an unlabled CL with exactly the same number of
# words. This is too conservative (e.g., a CL with 50 words won't be
# matched with a CL with 51 words), but I couldn't figure out how to relax
# the matching while maintaining similar mean comment_lengths.
model <- matchit(label ~ comment_length,
                 method = 'nearest',
                 exact = c('comment_length'),
                 data = df)

# Retain only matched samples
df_matched <- df[model$weights > 0, ]

system2("bazel-bin/src/convo_word_freq_diff", input = format_csv(df_matched))

I put the output in our shared folder today ("8/11/2020 politeness/fighting"), with a shared "umatched" folder which does the n-gram analysis without matching (that is, what we were doing before).

Some observations:

sophieball commented 4 years ago

Some other problems I've noticed:

sophieball commented 4 years ago

@CaptainEmerson In the Aug-11/unmatched folder, I put the 3-col layout of the ngram result. We have more non-pushback than pushback in top 20

CaptainEmerson commented 4 years ago

Looking good! Ones like "todo for" (assumedly, "...next CL"), "feel free", and "it makes more sense to" are pretty cool. I see there's still some boilerplate for me to remove, like "by the java compiler see caveats" -- I don't recognize it, so it looks automated.

CaptainEmerson commented 4 years ago

Decided not to implement this; changed metrics instead.