Closed CaptainEmerson closed 4 years ago
Here's a sketch of the code I wrote:
library(MatchIt)
# Count the number of words per comment
df$comment_length <-lengths(strsplit(df$text, "\\W+"))
# Pair every labeled CL with an unlabled CL with exactly the same number of
# words. This is too conservative (e.g., a CL with 50 words won't be
# matched with a CL with 51 words), but I couldn't figure out how to relax
# the matching while maintaining similar mean comment_lengths.
model <- matchit(label ~ comment_length,
method = 'nearest',
exact = c('comment_length'),
data = df)
# Retain only matched samples
df_matched <- df[model$weights > 0, ]
system2("bazel-bin/src/convo_word_freq_diff", input = format_csv(df_matched))
I put the output in our shared folder today ("8/11/2020 politeness/fighting"), with a shared "umatched" folder which does the n-gram analysis without matching (that is, what we were doing before).
Some observations:
Some other problems I've noticed:
fighting_words_freq.csv
, there aren't many n-grams with abs(z-score) >= 1.96 (the significance level)@CaptainEmerson In the Aug-11/unmatched folder, I put the 3-col layout of the ngram result. We have more non-pushback than pushback in top 20
Looking good! Ones like "todo for" (assumedly, "...next CL"), "feel free", and "it makes more sense to" are pretty cool. I see there's still some boilerplate for me to remove, like "by the java compiler see caveats" -- I don't recognize it, so it looks automated.
Decided not to implement this; changed metrics instead.
Per discussion with Carolyn and Sophie, differential n-gram frequency may simply reflect the fact that there's more text in high pushback reviews. Here, I'll attempt to correct for through using balanced samples.
I suppose I'll use the MatchIt package.
One threat is that the matched samples will be too small.