keyonvafa / tbip

Text-Based Ideal Points
MIT License
43 stars 15 forks source link

Visualization code, and getting topics at varying ideal point values #6

Closed Pranav-Goel closed 3 years ago

Pranav-Goel commented 3 years ago

Hi Keyon,

Thanks for releasing a well-documented code along with the tutorial and the interactive visualization at http://keyonvafa.com/text-based-ideal-points/. I was wondering if it would be possible for you to share the code you used to generate the interactive visualization - I would like to generate the sliders for the results of the model on my data. If it is not possible to share the code itself, I'd any pointers you can give to what you used to generate the visualization.

I also wanted to confirm how we can get the topic-word distributions or topics at varying ideal point values. Specifically, looking at this code snippet:

neutral_mean = objective_topic_loc + objective_topic_scale ** 2 / 2
positive_mean = (objective_topic_loc + 
                 ideological_topic_loc + 
                 (objective_topic_scale ** 2 + 
                  ideological_topic_scale ** 2) / 2)
negative_mean = (objective_topic_loc - 
                 ideological_topic_loc +
                 (objective_topic_scale ** 2 + 
                  ideological_topic_scale ** 2) / 2)

Is it accurate to say I can get the topic at ideal point value 0.5 and -0.5 as:

positive_mean_half = (objective_topic_loc + 
                   (0.5*ideological_topic_loc) + 
                   (objective_topic_scale ** 2 + 
                    (0.5*ideological_topic_scale) ** 2) / 2)
negative_mean_half = (objective_topic_loc - 
                   (0.5*ideological_topic_loc) +
                   (objective_topic_scale ** 2 + 
                    (0.5*ideological_topic_scale) ** 2) / 2)

? Or is that the wrong way to go about getting the means at varying ideal point values? Thank you!

keyonvafa commented 3 years ago

Thanks for the note!

I used plotly to make the interactive visualizations. Here's a link to the code: https://gist.github.com/keyonvafa/44fe71723fcbe3a0deba2d8c2de6eaf5

Your code is indeed correct. To see why, note that the ideological topic intensity for topic k and word v at ideal point x is given by E[beta_kv * exp(x * eta_kv)], where the expectation is with respect to the variational distribution. Dropping the indices, we have beta ~ Lognormal(mu_b, sigma_b^2) and eta ~ Normal(mu_e, sigma_e^2).

Since we assume the variational families are independent, we have

E[beta * exp(x * eta)] = E[beta] E[exp(x * eta)].

Since beta is lognormally distributed, E[beta] is the mean of a lognormal distribution:

E[beta] = exp(mu_b + sigma_b^2 / 2).

For the second term, we have that x * eta ~ Normal(x * mu_e, x^2 * sigma_e^2), using the scaling properties of Gaussians. Thus, exp(x * eta) ~ Lognormal(x * mu_e, x^2 * sigma_e^2), so

E[exp(x * eta)] = exp(x * mu_e + x^2 * sigma_e^2 / 2).

Putting this all together,

E[beta * exp(eta)] = exp(mu_b + x * mu_e + (sigma_b^2 + x^2 * sigma_e^2) / 2).

Line 15 of the gist I linked has a function that returns the ideological topics at a given ideal point:

def get_ideological_topics(objective_topic_loc, 
                           objective_topic_scale,
                           ideological_topic_loc, 
                           ideological_topic_scale,
                           ideal_point):
    ideological_topic_mean = np.exp(objective_topic_loc +
                              ideal_point * ideological_topic_loc +
                              (objective_topic_scale ** 2 + 
                               ideal_point ** 2 + 
                               ideological_topic_scale ** 2) / 2)
    return ideological_topic_mean

A small detail is that if you only care about the orderings of the words in a topic, you can remove the exponent (like I did in the code you pasted) since log is monotonic. However, for plotting, I chose to include the exponent since I'm visually comparing the actual topic intensities.

I hope this is helpful! Let me know if you have any more questions.

Pranav-Goel commented 3 years ago

Thanks for the prompt reply, the code, and the wonderful explanation! I think the explanation makes sense, the math checks out, and so the function at line 15 of the gist you linked is accurate to get the ideological topics at a given ideal point - and it makes sense that the exponent can be removed if we only care about the orderings of the words in a topic. However, it seems there is a discrepancy then in the analysis code (if I am understanding correctly). Specifically, my code and the code I pasted from https://github.com/keyonvafa/tbip/blob/master/analysis/analysis_utils.py#L214 seem to be incorrect or different as follows: if I were to replace ideal_point = 0.0 in : np.exp(objective_topic_loc + ideal_point * ideological_topic_loc + (objective_topic_scale 2 + ideal_point 2 + ideological_topic_scale ** 2) / 2) then we get:

np.exp(objective_topic_loc + (objective_topic_scale 2 + ideological_topic_scale 2) / 2)

however, as per https://github.com/keyonvafa/tbip/blob/master/analysis/analysis_utils.py#L214 and assuming neutral topic does mean ideal_point = 0.0, the code says:

neutral_mean = objective_topic_loc + objective_topic_scale ** 2 / 2

Am I missing something here? Similarly, it would seem positive_mean and negative_mean should have (objective_topic_scale 2 + 1 + ideological_topic_scale 2) / 2) although that addition of 1 will not impact the orderings of the words in a topic. For the neutral_mean, was the (ideological_topic_scale ** 2) term intentionally removed as it would also not impact the ordering?

This is only a minor clarification. With your code and your derivation, I am clear on how to get topics at varying ideal points, and once again, thank you for providing the visualization code, I really appreciate it!

keyonvafa commented 3 years ago

Ah thanks for catching this. The analysis code (https://github.com/keyonvafa/tbip/blob/master/analysis/analysis_utils.py#L214) is actually correct but the code in the gist to make the interactive figures was wrong -- it adds the squared ideal point to the scale rather than multiplying it. The math earlier shows that the expected ideological topic is

exp(mu_b + x * mu_e + (sigma_b^2 + x^2 * sigma_e^2) / 2),

but the code I sent was

np.exp(objective_topic_loc +
       ideal_point * ideological_topic_loc +
       (objective_topic_scale ** 2 + 
       ideal_point ** 2 + 
       ideological_topic_scale ** 2) / 2).

It should actually be

np.exp(objective_topic_loc +
       ideal_point * ideological_topic_loc + 
       (objective_topic_scale ** 2 + 
       ideal_point ** 2 * 
       ideological_topic_scale ** 2) / 2).

So now passing an ideal point of 0 results in the following log expectation:

objective_topic_loc + (objective_topic_scale ** 2 / 2),

matching the expression for the neutral topic mean. I went ahead and updated the interactive figures and the gist I sent. Fortunately the learned ideological topic scales in practice don't vary as much as the ideological topic locations, so the results are very similar.

Pranav-Goel commented 3 years ago

Ah, yes, I mistook a multiplication sign for addition...thanks for clarifying this in full. Glad that everything is figured out now and the code corrected where it needed to be. Thanks for providing the visualization code again!