prodriguezsosa / conText

An R package for estimating and doing statistical inference on context-specific word embeddings.
97 stars 18 forks source link

Interaction effects in embedding regression? #18

Open ldshuttleworth opened 1 year ago

ldshuttleworth commented 1 year ago

Hi, this is not exactly an issue, so I hope it is okay to post this here. I was wondering whether it is possible to include interaction effects in models using embedding regression. If we have the below example from your quick start guide using party and gender as covariates - is it possible to include interaction effects between gender and party in the model? Thanks in advance!

Warm regards, Luke

# two factor covariates
set.seed(2021L)
model1 <- conText(formula = immigration ~ party + gender,
                  data = toks_nostop_feats,
                  pre_trained = cr_glove_subset,
                  transform = TRUE, transform_matrix = cr_transform,
                  bootstrap = TRUE, num_bootstraps = 100,
                  permute = TRUE, num_permutations = 100,
                  window = 6, case_insensitive = TRUE,
                  verbose = FALSE)

# D-dimensional beta coefficients
# the intercept in this case is the ALC embedding for female Democrats
# beta coefficients can be combined to get each group's ALC embedding
DF_wv <- model1['(Intercept)',] # (D)emocrat - (F)emale 
DM_wv <- model1['(Intercept)',] + model1['gender_M',] # (D)emocrat - (M)ale 
RF_wv <- model1['(Intercept)',] + model1['party_R',]  # (R)epublican - (F)emale 
RM_wv <- model1['(Intercept)',] + model1['party_R',] + model1['gender_M',] # (R)epublican - (M)ale 
ArthurSpirling commented 1 year ago

something like:


# tokenize corpus
toks <- tokens(cr_sample_corpus)

# make gender numeric
levels(docvars(toks)$gender) <- c(0,1)
docvars(toks)$gender2 <- as.numeric(as.character(docvars(toks)$gender))

# make party numeric
levels(docvars(toks)$party) <- c(0,1)
docvars(toks)$party2 <- as.numeric(as.character(docvars(toks)$party))

# create simple interaction term
interaction_term <- docvars(toks)$party2*docvars(toks)$gender2
docvars(toks)$party_gender_interaction <- interaction_term

# refit model
model1 <- conText(formula = immigration ~ party + gender + party_gender_interaction,
                  data = toks,
                  pre_trained = cr_glove_subset,
                  transform = TRUE, transform_matrix = cr_transform,
                  bootstrap = TRUE, num_bootstraps = 100,
                  permute = TRUE, num_permutations = 100,
                  window = 6, case_insensitive = TRUE,
                  verbose = FALSE)

? Sorry if I misunderstood.

prodriguezsosa commented 1 year ago

That's right. @ldshuttleworth, it's on our to-do list to implement a similar notation for interactions as in lm. For now you'll have to manually add the interaction to your docvars (if working with a corpus or tokens object) or your dataframe -- as suggested by @ArthurSpirling.

ldshuttleworth commented 1 year ago

Thanks for the speedy and as usual helpful reply! After playing around with this in my own analyses, I have one further question. Should party_gender_interaction be a numeric variable or also character as gender and party in this case? Thanks in advance!

ArthurSpirling commented 1 year ago

@ldshuttleworth -- possibly misunderstanding: do you mean should it be a factor? That what's gender and party are. Anyway, consider running the following code (it will take a little while, because we are doing a large number of simulations):

library(quanteda)
library(conText)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

# make gender numeric
levels(docvars(toks)$gender) <- c(0,1)
docvars(toks)$gender2 <- as.numeric(as.character(docvars(toks)$gender))

# make party numeric
levels(docvars(toks)$party) <- c(0,1)
docvars(toks)$party2 <- as.numeric(as.character(docvars(toks)$party))

# create simple interaction term
interaction_term <- docvars(toks)$party2*docvars(toks)$gender2
docvars(toks)$party_gender_interaction <- interaction_term

set.seed(2)

# refit model
model1 <- conText(formula = immigration ~ party + gender + party_gender_interaction,
                  data = toks,
                  pre_trained = cr_glove_subset,
                  transform = TRUE, transform_matrix = cr_transform,
                  bootstrap = TRUE, num_bootstraps = 1000,
                  permute = TRUE, num_permutations = 1000,
                  window = 6, case_insensitive = TRUE,
                  verbose = FALSE)

model2 <- conText(formula = immigration ~ party2 + gender2 + party_gender_interaction,
                  data = toks,
                  pre_trained = cr_glove_subset,
                  transform = TRUE, transform_matrix = cr_transform,
                  bootstrap = TRUE, num_bootstraps = 1000,
                  permute = TRUE, num_permutations = 1000,
                  window = 6, case_insensitive = TRUE,
                  verbose = FALSE)

now, if you ask

 str(docvars(toks)$gender)

it should tell you it is a factor. Whereas,

str(docvars(toks)$gender2)

is numeric

But for a two level factor (which is what party and gender are) it shouldn't matter whether you tell R to handle as numeric or not. Similarly, the interaction should be fine as numeric.

So: if you look at the results you get for model1 and model2, they should be essentially identical, up to simulation accuracy. Is that not the case for you?

paride92 commented 3 months ago

Sorry for joining the conversation.

I was wondering and maybe is something obvious, once we run the model with an interaction as in the example above, how should we combine the coefficient for the various combinations of the covariates and later obtain nearest neighbours?

DF_wv <- model1['(Intercept)',] # (D)emocrat - (F)emale - Interaction 0 DM_wv <- model1['(Intercept)',] + model1['gender_M',] # (D)emocrat - (M)ale - Interaction 0 RF_wv <- model1['(Intercept)',] + model1['party_R',] # (R)epublican - (F)emale - Interaction 0 RM_wv <- model1['(Intercept)',] + model1['party_R',] + model1['gender_M',] # (R)epublican - (M)ale - Interaction 0

We basically will have also these possible combination

DFI_wv <- model1['(Intercept)',] + model1['party_gender_interaction)'] # (D)emocrat - (F)emale - Interaction 1

DMI_wv <- model1['(Intercept)',] + model1['gender_M',] + model1['party_gender_interaction)'] # (D)emocrat - (M)ale- Interaction 1

RFI_wv <- model1['(Intercept)',] + model1['party_R',] + model1['party_gender_interaction)']# (R)epublican - (F)Female- Interaction 1

RMI_wv <- model1['(Intercept)',] + model1['party_R',] + model1['gender_M',] + model1['party_gender_interaction)'] # (R)epublican - (M)ale- Interaction 1

While I can understand how to interpret the coefficient of the interaction I am confused about what the new combinations of the covariates tell me when running nns

Thank you for your help and the amazing package

ArthurSpirling commented 3 months ago

Thanks @paride92 but I think this is a bit more of a theory/statistical question than about the package per se And specifically I don't think we did much work on what the interaction would mean for NNs (as separate to having an interaction in a regression).

In the abstract, my response is basically that NNs are with respect to the way you as a researcher define groups for which you have obtained (average) ALC embeddings. So, whatever groups you think the interactions define, that is the basis on which you have NNs.

Sorry to not be more helpful-- @prodriguezsosa and @bstewart may have deeper thoughts!