uchicago-computation-workshop / Fall2020

Repository for the Fall 2020 Computational Social Science Workshop
13 stars 6 forks source link

11/12: Brooke Luetgert #8

Open ehuppert opened 3 years ago

ehuppert commented 3 years ago

Comment below with questions or thoughts about the reading for this week's workshop.

Please make your comments by Wednesday 11:59 PM, and upvote at least five of your peers' comments on Thursday prior to the workshop. You need to use 'thumbs-up' for your reactions to count towards 'top comments,' but you can use other emojis on top of the thumbs up.

rkcatipon commented 3 years ago

Dr. Luetgert thank you for sharing your paper! Many of us have had the chance to play with the Gapminder dataset in our Perspectives course, and so it's great to see an application of the dataset. My question is about your paper's approach to missinginess. What motivated 50% as the threshold for missingness? Was this based on previous iterations of clustering with PCA and K-means or was this choice more guided by intuition? When I read your paper I wondered if some missingness values had any linear correlation with other variables and if that might be useful knowledge?

sabinahartnett commented 3 years ago

Thank you for sharing your work with us Dr. Luetgert!

My question is about opportunities that this big data set could provide to compensate for countries with missing data. For the countries which had "particularly weak" (in the instance where there is not efficient data (whether that instance is geographic or chronologic)) data do you think there is opportunity based on your findings to 'match' these countries with others in a time and place and use that as a projection for potential developments and threats to the country?

For example, if country A in 2018 did not meet the 50% data threshold, but the amount of data that was there matched with patterns in country B in 2012 (or a group of countries). Do you think that it would be valuable to then track the progression of patterns (say, democratic stability) of country B over the following 5 years as a potential prediction for country A?

skanthan95 commented 3 years ago

Thank you for presenting at our workshop! I was struck by how accessible your paper was in its methodological descriptions (e.g., explaining the implications of the bend in Figure 1 and the pros and cons of using K-Means clustering), and hope that this level of accessibility will become more common in the computational social sciences.

Do you think that this lack of accessibility is an issue in the field as a whole, or has already been rectified?

Some additional questions about your project:

bakerwho commented 3 years ago

Thanks for your presentation!

From my understanding, you are using PCA to identify some type of 'weighting' on the set of global development indicators that explains the most variance in those scores. This allows us to attach to principal components some type of real-world meaning. For example, if the largest weights for Component #1 are on economic and healthcare indicators, we can infer that this component captures economic and health conditions. This opens avenues for many interesting analyses. For example, in order to then analyze temporal changes, you could then trace the trajectories of the single country over time.

In your section titled 'Defining the Problem', you discuss a number of concepts like populism, 'fractionalization', 'cleavages', and diversity. This was a very interesting section, but it left me confused. It would seem that different countries might have very different experiences of, and responses to social phenomena that are captured in these concepts. My question is this - how do you operationalize the complexity of socio-economic, political, ethical realities in terms of these global development indicators? For example, based on some of your results - what does it mean to belong to the n-th cluster in the PCA space?

Apologies if my question is too metaphysical!

SoyBison commented 3 years ago

Thanks for coming to the workshop! I'd like to ask about the correlates with the PCA components that you found. In the more successful decompositions, I'd be interested in seeing which human-readable features are the most explanatory in accomodating other human-readable features.

Thanks!

a-bosko commented 3 years ago

Thank you for sharing your paper with us, Dr. Luetgert! It was very interesting to read about variations in social and economic divisions across countries. Global development is an interesting and important topic, and I am glad to see computational methods being applied to this field of study.

In the Going Forward section of the paper, it is discussed how PCA and K-means are powerful, but they may lack a substantive interpretation. In your opinion, what is the best way to accomplish this, especially when looking at dynamic structures? Also, what are the best methods to analyze and find trends in the data?

Thank you!

MengChenC commented 3 years ago

Thank you for your this amazing paper. It provides a new approach for researchers working on analyzing and simplifying data with a great deal of indicators.

I am curious about since the K-means is restricted by preset and linear cluster boundaries, what is your viewpoints on harnessing other clustering algorithms along with PCA? For example, multiple logistic regression or QDA. Will they be better candidates when reducing dimensions or what are the potential risks? Thank you.

hesongrun commented 3 years ago

Thanks for the wonderful presentation. The unsupervised learning approach is truly promising. I am wondering if there is some intuitive interpretation of the PC you extract? How can we make better sense of the clusters by the K means for different countries. Thanks!

nwrim commented 3 years ago

Thanks for coming to the workshop! Similar to the question MengChenC asked, I was curious about why you decided to use PCA for the dimensionality reduction and k-means for the clustering algorithm. Why not other dimensionality reduction techniques like t-SNE, or clustering algorithms like GMM? I also am curious if there are if we can use hierarchical structures to cluster countries too. Would using clustering algorithms with hierarchy make sense for the data?

william-wei-zhu commented 3 years ago

Thank you for your paper. From a developmental state perspective, political systems are tools for a country to maximize its economic potential. A democratic political system may be compatible with certain types of economy, but incompatible with others. One important variable that may be missing from your analysis is the dominant type of industry that drives a country's economy. For example, country A and country B may have the same GDP and other quantitative development indicators, but if country A's economy mainly relies on raw resources extraction (e.g. oil extraction, agriculture), while country B's economy relies on post industrial sectors (e.g. tech-innovation, financial service), then country B is significantly more likely to be a democratic state than country A. How does your research project characterize different types of economy for countries?

MkramerPsych commented 3 years ago

Thank you for sharing your research with us! In line with my colleagues, I am curious about a potential hierarchical approach. Your conclusion makes it clear that individual differences in informational quality across countries could potentially bias any result based on this analysis. I am specifically curious if you could use a hierarchical approach to segment these countries into different groups based on some metric of informational quality and then perform dimensionality reduction on each group and compare results across those groups.

Dxu1 commented 3 years ago

Thank you for an exciting paper! I have two questions: 1) This question is similar to the question William Zhu asked. Currently, the k-means algorithm requires number of cluster as a pre-set input. I am curious how this algorithm could be expanded (or a new algorithm could be implemented) to ask the data to explore country-sectoral (or other dimensional) heterogeneity within the data itself without setting the group beforehand. 2) How do you see in the future if data measurement based on machine-learning could replace existing measurements? You have mentioned the some important limitations of using clustered data as measurement, including but not limited to difficulty in replication (could vary by cluster group), and potentially hugely affected by newly added data (updated entry or a new time-series). Furthermore, compared to traditional method, it is more complicated to interpret, and the interpretations seem to have less clarity. How would you compare traditional vs machine-learning-based macro-indicators?

NaiyuJ commented 3 years ago

Thanks for sharing! I think this is a fantastic political methodology paper, but I don't see much about comparative politics here. Democratic backsliding is a really good topic which I think big data can help a lot.

One thing I found very interesting is that in contrast to other politics paper, there are no dependent variables in this study. But I feel that you're trying to argue how this method can show us that the change of social cleavages contributes to democratic backsliding. The explanation is not very clear to me. The performance of this method is attractive, but how this method can better apply to comparative studies when we want to measure or feel the backsliding?

P.S.: I recently read a paper "Democracy in America? Partisanship, Polarization, and the Robustness of Support for Democracy in the United States" written by Milan Svolik, in which the authors use an experiment to show the tendency of democratic backsliding. In my sense, these two studies are using two different methods to study similar issues. That's interesting!

afchao commented 3 years ago

Thank you for sharing your draft! My question isn't very technical - I'm curious about the line in your conclusion about substantive interpretation: are results available about the issues which tend to be represented in the most principle components?

egemenpamukcu commented 3 years ago

Thank you for presenting your interesting work!

My question is about the potential negative real world socio-economic consequences of such computationally enhanced, high dimensional predictive research. I understand that you argue clustering countries into groups and imputing certain missing variables can partly help dealing with incomplete data, in the same way that it can help predict future macro-level indicators for those countries in the future. So my question is, do you think this can potentially lead to self-fulfilling prophecies about underdeveloped countries and their socio-economic trajectories?

As you said, most of the missing data, especially the ones indicating subjective measures like the HDI and Social Group Equality, are from certain countries (I am assuming most of those are the Bottom Billion countries), so would grouping a country with its apparent 'counterparts' (and merging/imputing data) hinder that country's development and reform efforts? Perhaps it would lead foreign investors and aid agencies (which those countries heavily rely on) to not invest in those countries as there seems to be no prospect of growth based on high dimensional predictive models, leading to a vicious cycle of underdevelopment in which low income countries that actually undertook important reforms would have difficulty breaking out of. I am not very familiar with the technical details of any of the methods you leveraged in the paper, but this seems to me as a predicament worth looking into.

k-partha commented 3 years ago

Thanks for presenting! Where in the causal chain do you see these PCA components belonging to? Do you think we can overlay institutional systems (referencing Acemoglu and Robinson) over these components to potentially uncover economic building blocks?

hhx2207061197 commented 3 years ago

Thanks for the sharing, my question is, for those countries with "particularly weak" data, do you think there is an opportunity to "match" these countries with other countries at a certain time and place based on your findings, and use this as a prediction of the country's potential development and threats?

adarshmathew commented 3 years ago

Thank you for presenting your paper at our workshop!

The data source you created combining the various sources sounds impressive (I didn't know about the CREG data), and I'd love to play around with it, if you choose to publish it. And your literature review pointed me to papers on fractionalization and conflict that I didn't know I needed until I read your paper.

Thank you once again! I hope your work interrogating these indices leads to a better framework that is able to synthesize them, or even lead to a rethink on how we measure them.

Yutong0828 commented 3 years ago

Thanks for presenting your work! It is very interesting. I have two questions for you.

  1. You mentioned that you could achieve more feasible indicators after reducing some poorly recorded countries like those in the Middle East and Africa. I was wondering that will such elimination lead to bias in the model? How will you decide which country to include and which country to exclude in analysis? I often feel that such process is hard to keep a balance between high data quality and more comprehensive information.
  2. What do you think these two PCAs and four clusters may suggest? Could you please explain more about the implications of these factors/categories? Thanks!
Anqi-Zhou commented 3 years ago

Thanks for sharing! My question is which human-readable features should be the most explanatory in accommodating other human-readable features? Why is that?

LFShan commented 3 years ago

Thank you for the presentation. I totally agree that there is a democracy backsliding happening around the world. I think it is really innovative to use a lot of indices and observe their general trends. KNN was used to mitigate many missing values in the data. I would like to know in addition to use KNN, can we use the fact that there are missing values to acquire more information about the country (Political instability, Inefficient Statistic Bureau, etc.)

JadeBenson commented 3 years ago

Thank you! I find it to be a very interesting problem of how we represent patterns over space and time when we have constantly changing measures. My question is - is there any way to capture how these measures are changing as a way to represent social changes? My idea would be that the addition of different racial categories to census is in itself an interesting indicator of how a society is shifting over time.

Bin-ary-Li commented 3 years ago

Thank you Dr. Luetgert. This is very interesting work. Love to see new engineering methods being employed by social scientists. To my understanding, PCA and K-means both require to know in advance the number of clusters to partition the data into. I wonder if you have tried any other non-parametric methods that don't require that (e.g. Dirichlet Process).

Yilun0221 commented 3 years ago

Thank you for the presentation, Dr. Luetgert! I am very interested in the dimentionality in your work. My question is about dimentionality and PCA. People are trying to represent different perspectives of human life with data, as you have mentioned in the paper. However, PCA is used to pick out more influential features. I am kind of confused about how does this two principles work or are balanced in your research?

hihowme commented 3 years ago

Thanks for your presentation! I am wondering do you have any idea how should we apply those methods into economics research? Thanks a lot!

Lynx-jr commented 3 years ago

Thanks for sharing! Although I am a huge fan of K-means and PCA, I had basically the same question with @nwrim while reading, that is -- why not other dimensionality reduction techniques or clustering algorithms? But I might be able to answer this question by myself... the author only wants to see how K-means and PCA might fail and explore their applications, it's rather about "data fits the method"...

jsoll1 commented 3 years ago

Hi, thanks for your sharing your awesome paper! In the first years MACSS classes, we've been covering ethics lately so my question is based off of that. One of the important things to do for studies is apparently to try to prioritize Justice, which in a lot of cases is interpreted as equity of opportunity to benefit from the results of the research by being included in the study. From that framework, how do you balance suggesting to move these kinds of rich studies away from datasets including more unstable poorly recorded countries with the idea that have results from these studies could be useful for them?

Qiuyu-Li commented 3 years ago

Thank you for coming to our workshop. My question is: based on my understanding, the K-means clustering rests on the correlation between observations or variables. Do you think that your predicted result might just be a vector of existing data, instead of increased variation?

bazirou commented 3 years ago

Thanks for the wonderful presentation. I'm wondering how you select the initial points of K-means since the selection of initial points has a huge influence on the result.

bjcliang-uchi commented 3 years ago
  1. Why are variables like GDP and total population considered independent in the data when they are known to be highly correlated?
  2. What are outlier countries in the k-means cluster? Given that GDP and population are included, I suppose that countries like Russia, the U.S., China, would be outliers, then what can the cluster on the rest of the countries tell us?
  3. Is there any substantive meaning of the selection of k? If not, I suppose that there are many ways to categorize countries by their similarities and differences, and how can this clustering method improve our understanding of global development?
tianyueniu commented 3 years ago

Thank you for your presentation! I look forward to learning more about how you interpret the clustering results.

yutianlai commented 3 years ago

Thanks for sharing! I'm wondering why you found unsupervised machine learning specifically useful.

xxicheng commented 3 years ago

Thanks for sharing. I have a similar question as @Qiuyu-Li. What do you think of the ethics questions in this research? What specific research design you used to solve those problems?

NikkiTing commented 3 years ago

Thank you for sharing your work! One of the things you've mentioned in going forward is that to have a substantive interpretation there is a need "to look closer into the issue areas that are captured by the first and second features." Would you have any hypothesis on which specific issue areas might be significant to look into for future research?

YanjieZhou commented 3 years ago

Thanks very much for your presentation! I am wondering whether unsupervised machine learning can really help compensate for missing data in those countries considering that different culture often plays an important role in social science research.

goldengua commented 3 years ago

I was wondering why you choose PCA for dimensionality reduction and K-means for clustering? How much variance could be explained after you performed PCA and do these features have a reasonable explanation?

FrederickZhengHe commented 3 years ago

Thanks for sharing this marvelous paper! I have had no specific question so far, but look forward to attending your presentation tomorrow!

Leahjl commented 3 years ago

Thank you for the presentation. I wonder how do you deal with the missing data for certain countries?

romanticmonkey commented 3 years ago

Thank you for your presentation. I'm not an expert in politics, but I was wondering: Does data on culture and religions would also play a part in the line of your study?

JuneZzj commented 3 years ago

Thank you for presenting. The difference between applying clustering and PCA in your analysis is quite detailed. I am wondering if this distinction between the two methods also seems to be apparent in other social science circumstances.

vinsonyz commented 3 years ago

Thanks for your presentation! How would you interpret figure 6: K-Means estimate of country clusters in 3D?

cytwill commented 3 years ago

Thanks for this interesting presentation. This is a good example of computational data science. I think PCA is quite a commonly used method in this field. As you mentioned in the summary, I am wondering if there would be some new indicators that can be defined through your PCA results, i.e., how to interpret the principle component from your results? Similarly, as you mentioned that "our theoretical work is more advanced than our indicators", but PCA results are actually generated from these old indicators, so do you think they can indeed bring us new information?

alevi98 commented 3 years ago

Thank you for the presentation! I found the section discussing cleavages and the backsliding of democracy fascinating and highly relevant! I'm wondering (like many others) what approaches you are taking to solve the issue of some countries lacking sufficient data? Are there ways to infer particular fields based on measures in comparable countries (geographically similar, geopolitically, etc.)? Another thing I was wondering: have you used this computational model on other units of division beyond the nation-state? I'd imagine it's harder to find as robust of datasets for region or province, but how would this change the heterogeneity and cleavages? Has someone run this (or a similar) model within the U.S. on state or even county-level data?

luxin-tian commented 3 years ago

Thanks for sharing. I wonder how could k be interpreted and how does it reveal any insights into global development?

mingtao-gao commented 3 years ago

Thank you for your presentation in advance! This paper provides a good big data approach to access global development using PCA and k-means. Because I'm not from a political science background, my question is related to the indicators used in the approach. How are democracy ( and social group equality, religious freedom, etc...) measured in the dataset?

bowen-w-zheng commented 3 years ago

Thanks for sharing your work! My question is in some way similar to some previous questions about different model specification. ML methods have all kinds of parameters to set. There seems to be no foundational guideline on how to systematically explore these parameters. How does this affect the research conclusion, especially in social science where robust inference is preferable?

Raychanan commented 3 years ago

Why do you use PCA and K-means methods? Are these two methods too simple? Are there any particular advantages of using these two methods in this issue compared with other methods? Thanks!

Yaweili19 commented 3 years ago

Thank you in advance for your presentation. I think your selection of research topics is very interesting. As we continue to collect and expand the scale of various human development indicators, we really need to pay more attention to how to better use them to improve our empirical research. I read the reference materials in a hurry, and I did not understand how you established the links between the indicators at different levels. Hope to find the answer from your lecture.

jinfei1125 commented 3 years ago

Thanks for your papers! It's very interesting! In the data part, you said you collect the data of 30 indicators from World Bank/IMF, Freedom House, WHO, UN, ILO, IDEA, Polity IV, Gapminder and CREG. These variables are diversified, coving climate change, civil right, criminal rates and so on. But I am curious why you choose these variables and Data Source? I know there are nearly thousands of variables on World Bank, why you choose these 30 variables instead of others? Thank you! P.S. Page 8 is blank. Is these supposed to be something?

yongfeilu commented 3 years ago

Thanks for your presentation! Could you please explain a bit more about why you choose PCA and K-means methods? How do these methods help you tackle your research questions effectively?