uchicago-computation-workshop / Winter2020

Repository for the Winter 2020 Computational Social Science Workshop
11 stars 0 forks source link

02/27: Ferrari #6

Open jmausolf opened 4 years ago

jmausolf commented 4 years ago

Comment below with questions or thoughts about the reading for this week's workshop.

Please make your comments by Wednesday 11:59 PM, and upvote at least five of your peers' comments on Thursday prior to the workshop. You need to use 'thumbs-up' for your reactions to count towards 'top comments,' but you can use other emojis on top of the thumbs up.

liu431 commented 4 years ago

Thank you for the talk. You mentioned that the hdpGLM model is important in political analyses. However, a political analyst or research assistant probably won't have time and technical background to fully understand the paper and code the model. So what's your idea for making the hdpGLM model available to other social science researchers? Do you think it could be an extension in the glm package one day?

nwrim commented 4 years ago

Welcome (back) to our workshop! I am looking forward to seeing you talk about a topic in the realm of Social Science, outside CAPP 30121-2. (But I note that I really like CAPP 30121-2!)

I think that accounting for latent heterogeneity in the data is really crucial, especially within social science (as you mentioned in the introduction). Although it does share the limitation of GLM, I think a way to find and address the subpopulation heterogeneous effect.

One question I have is about the interpretability of clusters from the hdpGLM. As far as I understand (and correct me if I am wrong), hdpGLM could find the latent heterogeneity and find the clusters based on it, but it does not really give a good explanation of what makes the clusters different. For example, on the toy example from the Monte Carlo simulation, you gave an example of $X_1$ being income (where $\beta_1$ is different for each cluster) and the latent variable being how the individual got the economical reward. But in the real world, we would know that the parameters of a variable (income) - ergo the estimated effect of the variable - are different in the clusters, but we would not know the latent variable that caused this cluster difference (how individuals got the income). Are there any ways to go around this? I think this is a common problem for almost all clustering algorithms, but I am curious about your perspective as a suggester for the model.

Another thing I am curious about, actually quite unrelated to the topic of the article itself, is how did you select the articles to be used in section 6 (Empirical Application). Did you choose studies where you thought the method will make a difference? Or choose studies that you thought could be considered as a very general study in the field? I might start a project on suggesting a new method soon, so I want to know your insight as a researcher!

I apologize in advance if my lack of knowledge in mathematics, statistics, and/or modeling caused me to ask a nonsensical question. Thank you very much for presenting again!

ziwnchen commented 4 years ago

This paper proposed an interesting model that extends the ability to detect latent heterogeneity of covariates in GLM models. My question relates to the application field. Since hdpGLM provides more information on covariates, is it true that we might want to use hdpGLM instead of GLM? How could we explain the value distribution of the covariates if they show certain patterns (e.g., bimodal pattern)? Is there any practical difference (e.g., computation, data size) in terms of GLM and hdpGLM?

bjcliang-uchi commented 4 years ago

Thank you for the presentation! I am wondering how to interpret the clusters generated by the MCMC sampling method. How often can we tell the empirical meanings of the clusters we obtain from Xi's so that different clusters represent different groups observed in reality (say, countries, identities, etc)? Or is it more like an unsupervised clustering methods--but based more on the latent effects rather than the direct features of Xi's? If latter, how can we make the clusters--and the existence of Simpson's paradox--convincible and helpful for further research, especially when the number of clusters itself can also be arbitrary?

hesongrun commented 4 years ago

Thanks a lot for your presentation. It is great to harness the power of bayesian statistics to understand heterogeneous latent effect. However, is such complexity necessary? What is your motivation for using such complex model instead of simply introducing some sort of country level fixed effect to the analysis? Thanks!

wanitchayap commented 4 years ago

I am SO EXCITED to see you as our speaker!!! Really looking forward to this week workshop :)

From table 3, it seems that as the true k increases, the accuracy of estimated k decreases and the maximum estimated k is increasingly over-estimated. Are you expecting the same trend to continue for true k > 10? Would you recommend this model to be used in fields like biology or genetics where true k could be very high (and where over-estimated k could be harmful)?

Screen Shot 2020-02-26 at 8 02 22 AM

tonofshell commented 4 years ago

To add to what @liu431 was asking, I've noticed that you've already started development on an R package to use hdpGLM (https://github.com/DiogoFerrari/hdpGLM). What were some of your most significant challenges with implementing the math behind hdpGLM as an R package? What advice would you give those of us looking to turn our own big ideas into software packages?

zeyuxu1997 commented 4 years ago

Thanks for presenting. Computational feasibility is a significant problem when doing empirical research in the era of big data. In your paper, you have shown us the advantages of hdpGLM. However, could you please introduce what "price" we should pay to get such advantages? What's the impact of the new model on the computational difficulty? When should we choose the second best dpGLM or traditional GLM instead of the more accurate hdpGLM?

linghui-wu commented 4 years ago

Great to see you outside of the CAPP class, Dr.Ferrari.

My question follows the one of @hesongrun. In section 3, the paper discussed the relationship between hdpGLM and classical models in measuring heterogeneity and it concludes that “the hdpGLM can be viewed as a generalization of the other models.“ Specifically, it mentions when $Z_{i}$, the design variable indicating the group of i, cannot be observed, the hdpGLM model gains an advantage over classical mixed models for the group heterogeneity. Regarding the point, I wonder is there a commonly used criterion to determine whether the abovementioned variable can be observed so that we can choose the appropriate model for further analysis?

romanticmonkey commented 4 years ago

Thanks for presenting this week, professor! Accounting for another layer of social heterogeneity is always a challenge in predictive models in social sciences. But I am interested in a even more abstract level of homogeneity -- cultural homogeneity. Do you have any experience in dealing with cultural differences in perhaps a voting behavior model? Would you mind sharing how you have quantified cultural homogeneity into actionable parameters?

adarshmathew commented 4 years ago

Thank you Prof. Ferrari for your presentation. This method is a fascinating blend of Bayesian tools with the more classical (and frequentist) GLM literature.

I have two questions, and forgive me if they're trivial or lack rigour:

  1. I'm curious about how your definition of contexts includes time and the challenges that brings. The latent context associated with time period T1 might have a significant causal bearing on that of T2 -- attitudes towards welfare could be negative in 2010 because of a disastrous implementation in 1990. How does your method account for relationships between latent context variables? This could also extend to the dependencies between different geographic entities too (eg: similar attitudes on the role of the State in different South Asian countries owing to a shared colonial history).

  2. In trying to deal with latent heterogeneity, a crude and possibly unsound workaround in my head was to identify clusters apriori using a finite mixture model (like GMMs) and then apply GLMs to each individual cluster. What substantive improvements does your method bring compared to this crude approach?

nswxin commented 4 years ago

Looking forward to learning more about your model tomorrow! To be honest, I have difficulties in understanding some abstract terms and concepts, for example, 'individual-level covariates ' and 'Simpson's paradox'. I hope you can use some examples tomorrow to explain these terms.

lulululugagaga commented 4 years ago

I'm excited about this workshop:) This is a really interesting methods discussion paper that gives details of each model. Therefore, my question lies on the application side. Do you think there is restrictions on data (scale, type, range, etc.) when we use your conclusion to use particular models?

KenChenCompEcon commented 4 years ago

Thanks in advance for presenting such intriguing materials! I am very excited to see how the country-level characteristics can form latent heterogeneity and help to explain polarization in political ideology. But I am also thinking of possibilities of any confusing effects of the characteristic variables chosen and missing key factors due to the data collection process?

ShuyanHuang commented 4 years ago

Thank you for presenting! Have you study the identification of your model? It seems non-trivial to separate heterogeneous marginal effects and pure population noise, especially when the dimension of latent states is high.

vinsonyz commented 4 years ago

Thank you for your representation! It's really amazing to have hdpGLM model to estimate the heterogeneous effect. I wonder how this model would be applicable to Economics research.

SoyBison commented 4 years ago

Thank you for your talk Diogo!

I wanted to ask if you think that using a statistical learning model like yours can be supplemented by making different assumptions about the population distribution. For example if you add in the assumption that some feature is n-modal, can the model leverage this for better performance on the other features? To further extend this, if we have reason to believe theoretically that there is some true k, can this model be used to verify or falsify that hypothesis?

Leahjl commented 4 years ago

Thank you for the presentation in advance! It is exciting to have you talk about the latent heterogeneity in political research. I'm curious about the the hdpGLM model and GLM model you mentioned in the paper, when should we choose one over another?

timqzhang commented 4 years ago

Thank you for your presentation. It is really a technical-like issue in the fields like empirical researches. I wonder if there is a cost-benefit analysis for this new hdpGLM method, ie when implementing this method, how much workload should the researchers go through and how does it compare with the effect on researches compared to the original GLM. I understand that it will be a good method improvement for GLM, but it needs time for the frontline researchers to test whether it is suitable to be widely implemented.

di-Tong commented 4 years ago

Thank you for sharing this piece! Could you give us more examples on what kind of new research questions can be answered in different disciplines with the help of this method?

WMhYang commented 4 years ago

Thank you in advance for your presentation. My first question is based on my intuition (which may not make any sense). Given that hdpGLM yields more fruitful result compared to GLM, does it mean that we need more data in order to make the estimation possible? My second question is about the empirical application part. Like @nwrim, I am also curious about how you choose the two papers. Are there any signs in these papers or corresponding data indicating potential latent heterogeneity? Could we make use of some statistic measure, instead of referring to theories (which may not give us a consensus), to test whether there is potential latent heterogeneity?

PAHADRIANUS commented 4 years ago

Thank you for sharing the phenomenal work on modeling. It seems to me that the paper has sufficiently demonstrated that when the ordinary GLM applies, the new hdpGLM is a strictly better alternative: it accommodates all the information that a GLM captures and compensates for the GLM's inability to detect latent sub-population heteroskedasticities. I am really interested in how the marvelous new model is implemented computationally. This paper does provide a few examples of the hdpGLM in practice, though primarily replications of previous projects for the purpose of comparison. I wonder if there is any case of a brand new research conducted with the model that we can use as a reference. Do you think the new model have the potential to overwhelmingly replace the GLM as the standard approach?

ydeng117 commented 4 years ago

Thank you for presenting such an impressive model for finding latent heterogeneity. I wonder how precise can this model can detect the heterogeneity. In your example, the model is used for testing the context differences among countries. Nonetheless, can the model also apply to individuals' characteristics such as socioeconomic status, education level, and ideologies?

dongchengecon commented 4 years ago

Thank you so much for the presentation! You have mentioned in the paper a lot of advantages of the hierarchical Dirichlet process of GLM under specific cases. Could you share with us some scenarios in which the hdpGLM method might not be recommended for estimation?

YuxinNg commented 4 years ago

Thank you for the paper! Looking forward to the presentation tomorrow. It is mentioned that the model you are going to introduce is important in political analyses. But political analyses is a huge huge field, which include topics like voting, international relations, security, etc. I am curious that if this model has some specific constraints, that is, if there is any political research field that this model does not work well. Or if this model can be applied to other subjects other than politics. Thanks!

policyglot commented 4 years ago

Yay! The first presentation by someone from the new MACSS faculty for 2019-20! Welcome again.

Quick questions: 1) In the Germany example in the empirical section, you found that 'for most of the states there is no latent heterogeneity.' Under what circumstances is this likely to occur/ in which countries have you observed this trend? 2) In the context of the United States, could such homogeneity be in some ways 'engineered' through the practice of gerrymandering that Gary King discussed last quarter in his workshop? (the link no longer works and thus harpoons how cool this callback could have sounded) 3) Will you be teaching this content next quarter in MACS 30122/1 Political Behavior and Computational Social Science? The course looks quite exciting! (shameless plug)

sunying2018 commented 4 years ago

Thanks for your presentation! This paper implements hdpGLM model on the country level to capture country-level characteristics. I am thinking whether this method can be implemented on a more micro-level such as neighborhood or community for other specific research questions and if there are some additional staff need to be take into consideration?

ShanglunLi commented 4 years ago

Thank you for providing such an interesting paper to read! My question is what are the assumptions on the data so that we can use your conclusion on some particular models? Thanks!

MegicLF commented 4 years ago

Thank you for the presentation! I wonder under which cases hdpGLM model may have disadvantages compared to other models such as classical GLMs, GLMM, FMMs, and dpGLM. Do you have any concerns about using these models in empirical research, and what should we be aware of when using them?

YanjieZhou commented 4 years ago

Thanks for your presentation! I am wondering how this unsupervised clustering method is applied specifically. Considering that each cluster is defined according to the latent heterogeneity, when applying unsupervised methods, will this unsupervised still be interpretable like representing different countries?

ChivLiu commented 4 years ago

Thank you for the presentation! I wonder which specific topics could this model be applied to analyze and whether the topics need to have significant effects internationally. The major ideologies in different countries could be either homogeneous or heterogeneous. Therefore, is this a good topic for this model?

yongfeilu commented 4 years ago

Thank you for your presentation!! The unsupervised clustering method is really impressive and the application of latent heterogeneity is amazing! My question is under which circumstances we can use this method and to which degree this methodology can work? Thanks!!

RuoyunTan commented 4 years ago

Thank you for sharing your paper with us. I really like how you carefully compare the different models. Could you further explain the tradeoffs that we face when selecting from these models?

Yilun0221 commented 4 years ago

Thank you for the presentation! I am really interested in the topics on using Monte Carlo in analyzing public political polarization. I have a question. I think in the process of applying this model to different countries, we should consider the differences among these countries which may reflect the modeling result. However, with the number of countries increasing, the differences may be more and more complicated. What should we do to solve this problem? Thanks!!!

hanjiaxu commented 4 years ago

Thank you very much for presenting! Could you please give us an example of how to apply this model to a specific dataset? In addition, could you please explain how to select those models based on the characteristics of the dataset?

goldengua commented 4 years ago

Thanks for your interesting work. I am concerned with the interpretability of GLMs. I was wondering to what extend we can trust the clusters represent meaningful factors and how we can probe the characteristics of the clusters? If we applied many dimensionality reduction methods, such as PCA, can we still obtain meaningful clusters?

JuneZzj commented 4 years ago

Thank you for presenting. It is interesting to see how the latent heterogeneity affects context-dependent models. After comparing GLM, finite mixed models and hdpGLM, we can observe that the last one has better performance in detecting the latent effects. I am wondering if there are some other specific research questions can be better answered by simple GLM or finite mixed models. Or what are the advantages of those models compared with hdpGLM. Thank you.

SiyuanPengMike commented 4 years ago

Thanks a lot for this interesting and inspiring paper. It's my first time to see the application of hdpGLM model. Could you please give some concrete examples of the advantage of using this model? What's the biggest improvement of hdpGLM compared with the simple GLM model? Thanks again for your paper and looking forward to tomorrow's presentation!

Anqi-Zhou commented 4 years ago

Thanks for your great paper and presentation! As many of us mentioned before, so when should we choose to apply hdpGLM instead of GLM model? What advantages can this model bring to us? Look forward to an inspiring explanation.

HaowenShang commented 4 years ago

Thanks for your presentation! In the paper you introduced a complex hdpGLM and illustrate the relationship between that with the classical GLMs, GLMM, FMMs, and dpGLM in terms of the structure of the average parameters of the outcome variable. Could you please give some examples when will we use hdpGLM and when will we use other models?

yutianlai commented 4 years ago

Thanks for your work and I'm looking forward to your presentation! I'm wondering why hdpGLM is important in political science field and what makes it special.

sanittawan commented 4 years ago

Thank you very much in advance for sharing your research with us. I have a very small question, but I think it is going to help me understand the model better. In your model specification on page 5, p() is a distribution in the exponential family. Is there a specific reason that it has to be from an exponential family?

At the end of the paper, you alluded to the selection of covariates into the model which can cause problems with the estimation. Can you please give an example to illustrate this point further?

jsgenan commented 4 years ago

Thanks for sharing this! Although this is not my familiar topic, I'm looking forward to hearing your insights. My question is for the MCMC simulation: besides the results reported in Table 2, how would you summarize the difference between hdpGLM and GLM in other bulks of simulations?

chiayunc commented 4 years ago

Thank you Professor! It is such a privilege to hear about your work. My question is that since latent heterogeneity is very prevalent and is something that needs to be addressed in many other scientific disciplines, do you anticipate any pitfalls with the new method you proposed? what are the edges and disadvantages with the new method? thank you!

jtschoi commented 4 years ago

Thank you very much for your presentation in advance! Is this method compatible with multiple hypotheses testing (and related techniques)?

bazirou commented 4 years ago

Thanks for your work and I'm looking forward to your presentation!

mingtao-gao commented 4 years ago

Thank you in advance for the presentation! It is very interesting in the paper how you have applied the hierarchical Dirichlet process of GLM for specific cases. Can you share some insights on some other cases that such a process can be applied successfully and what might be the cases that hdpGLM may not be well-applied for estimation?

hihowme commented 4 years ago

Thanks a lot for your presentation in advance! I am wondering do you have any criterion on choosing the two papers? Also, How can we test the latent heterogeneity under your model? Thanks a lot.

fulinguo commented 4 years ago

Thanks very much for your presentation. I am curious about how this paper is related to statistical decision theories. Could you elaborate more about how you consider uncertainties of statistical models and what criteria you used to estimate the efficiency of the general methods in the paper? Thanks!

caibengbu commented 4 years ago

Thanks for your work and I'm looking forward to your presentation!