Balance Total Comment Length Across Pushback vs. Non-Pushback for Fighting Words Analysis

CaptainEmerson commented 4 years ago

Per discussion with Carolyn and Sophie, differential n-gram frequency may simply reflect the fact that there's more text in high pushback reviews. Here, I'll attempt to correct for through using balanced samples.

I suppose I'll use the MatchIt package.

One threat is that the matched samples will be too small.

CaptainEmerson commented 4 years ago

Here's a sketch of the code I wrote:

library(MatchIt)

# Count the number of words per comment
df$comment_length <-lengths(strsplit(df$text, "\\W+"))

# Pair every labeled CL with an unlabled CL with exactly the same number of
# words. This is too conservative (e.g., a CL with 50 words won't be
# matched with a CL with 51 words), but I couldn't figure out how to relax
# the matching while maintaining similar mean comment_lengths.
model <- matchit(label ~ comment_length,
                 method = 'nearest',
                 exact = c('comment_length'),
                 data = df)

# Retain only matched samples
df_matched <- df[model$weights > 0, ]

system2("bazel-bin/src/convo_word_freq_diff", input = format_csv(df_matched))

I put the output in our shared folder today ("8/11/2020 politeness/fighting"), with a shared "umatched" folder which does the n-gram analysis without matching (that is, what we were doing before).

Some observations:

The unmatched, original data ("class1_func returned 495 valid utterances. class2_func returned 719 valid utterances.") is larger than the matched data ("class1_func returned 198 valid utterances. class2_func returned 198 valid utterances."), as expected.
The unmatched data doesn't contain any of the REPLACED_* tokens -- what happened to these?
The matched n-grams make less sense to me, with the top 5 being no, make it, about, me, include.

sophieball commented 4 years ago

Some other problems I've noticed:

In the Aug 11 matched fighting_words_freq.csv, there aren't many n-grams with abs(z-score) >= 1.96 (the significance level)
Same file, most of the n-grams are uni- or bi-grams

sophieball commented 4 years ago

@CaptainEmerson In the Aug-11/unmatched folder, I put the 3-col layout of the ngram result. We have more non-pushback than pushback in top 20

CaptainEmerson commented 4 years ago

Looking good! Ones like "todo for" (assumedly, "...next CL"), "feel free", and "it makes more sense to" are pretty cool. I see there's still some boilerplate for me to remove, like "by the java compiler see caveats" -- I don't recognize it, so it looks automated.

CaptainEmerson commented 4 years ago

Decided not to implement this; changed metrics instead.

sophieball / toxicity-detector

Balance Total Comment Length Across Pushback vs. Non-Pushback for Fighting Words Analysis #65