uchicago-computation-workshop / ma_proposal_workshop_a1

0 stars 1 forks source link

Extension: Austin C. Kozlowski, Matt Taddy, James A. Evans (2018) #20

Open bhargavvader opened 5 years ago

bhargavvader commented 5 years ago

Kozlowski et al (2018) have attempted a very interesting take at understanding culture through text. Word embeddings were developed as a tool to better understand words in a high dimensional space, as well as in performing vector operations on these projected words. While traditional work in word embeddings involved in using these word embeddings to carry out tasks such as learning the closest related entity, or to learn phrases, the nature of the space itself was not examined in as much detail.

They propose a new method/computational approach using these word embedding models. If we could describe a particular research question which the article attempts to address, it would be - “Could one analyse cultural semantics, cultural change, and cultural differences using word embedding models”, or “How would one analyse cultural semantics, cultural change and cultural differences using word embedding models?”.

Their idea is quite ingenious in the way it builds on the idea of differences in these high dimensional vector spaces; in particular that there are certain dimensions which encapsulate certain social characteristics. The example of the gender dimension, best represented by the vector operations “man - woman”, or “he - she” , capture semantic information of how gender is encoded in words in text. By taking an average of five to six such pairs which the authors believe best represent the dimension, they can then project values on this dimension and see where on this dimension do the words lie - if, after a cosine similarity check on a dimension we receive a positive value, it lies on the masculine area of the dimension, and feminine area of the dimension. The authors also note that these signs depend on whether it was “man - woman” or “woman - man” to create the dimensions.

From the above explanation it is quite clear to see the kind of extensions we can make to these kind of methods. We can use these qualitative dimensions to analyse where words lie on these vectors: while the authors mention a couple of examples themselves, such as how different words lie in different places across dimensions trained on different classes (e.g the word "worker" on the class dimension in USA vs England). The authors only used English language texts in their analysis but the method can very easily be used across multiple languages if multi-linguals are working on the dataset, and also opens up an entire realm of socio-lingual analysis to understand how different languages represent words with similar semantics but possibly very different contexts, and can possibly provide insights about how society influences language and vice versa.

Another extension might be to move words along the dimensions and change it's original location in the space. For e.g, can we project the word "programmer" on the gender dimension, and move it more towards the feminine end of the spectrum, so as to remove the bias in the word? This could be a way to fix biased word embedding models and vectors.

Kozlowski, Austin C., Matt Taddy, and James A. Evans. "The geometry of culture: Analyzing meaning through word embeddings." arXiv preprint arXiv:1803.09288 (2018).