Sentiment Dictionary for Organisational Culture & Environment

nick-sh-oh commented 11 months ago

While collecting and processing, we realised most of the existing sentiment dictionaries out there are applicable in either general (i.e VADER) or financial (i.e MASTER) domains. Other than the domain of political science (i.e the Manifesto Corpus), there is no existing sentiment dictionary in other social domains. But as Loughran and McDonald (2011) commented, 'words have many meanings, and a word categorisation scheme derived for one discipline might not translate effectively into (other) discipline'.

We propose building a sentiment dictionary that measures sentiment in textual data relevant to the organisational culture and enviornment. And here is a brief research design sketch:

1. Data Collection:

Subreddits about Organisational Culture & Environment [r/workingmoms, r/antiwork, r/WorkReform, r/Work, r/union, r/StrikeAction, …]
Glassdoor Employee Reviews from S&P500 from 2018 to 2022.

2. Filtering:

N-grams (incl. unigrams) that appeared at least 5% of sampled data (Loughran and McDonald, 2011; Bodnaruk, Loughran and McDonald, 2015)
Constructing a core dictionary inspired by established sources (Strapparava and Valitutti, 2004; Hutto and Gilbert, 2014). For organisational culture domain, we can use dictionary such as: (i) Lasswell Value Dictionary - Sociological classification of language into four deference domains - ‘power’, ‘rectitude’, ‘respect’ and ‘affiliation’ - and four welfare domains - ‘wealth’, ‘well-being’, ‘enlightenment’ and ‘skill’. Provides resources for understanding values and motivations; (ii) VADER - Hire western-style emoticons (i.e “D=”) and emojis, Acronyms and Initialisms (i.e “AITA”, “WTF”); (iii) Harvard IV-4 - Particularly ‘Social Relations’ (words about social roles, groups and interactions), ‘Communication’ (words related to communication modes), ‘Motivation’ (words related to needs, goals and achievement)

3. Expanding Verb-forms: Suppose we filtered “promote personal growth” from the previous step. We consider variations of “promote” and expand n-grams by adding “encourage personal growth”, “advance personal growth”, “assist personal growth”, “aid personal growth”, and so on.

2of12inf, a collection of word inflections (Loughran and McDonald, 2011)
WordNet synsets (Strapparava and Valitutti, 2004; Valitutti, Strapparava and Stock, 2004)

4. Labelling:

Sentiment Labelling: Wisdom-of-the-Crowd Approach: Multiple independent raters rating each lexicons on a scale from [-X, +X] (Hutto and Gilbert, 2014)
ESG Labelling: Sub-components of Social & Governance theme (following ISSB, https://www.ifrs.org/groups/international-sustainability-standards-board/#resources) S: (i) Human Capital - All aspects of human capital management including employment practices, talent development, safety, and the labour standards of suppliers.; (ii) Product Liability - The potential for products to cause harm because of quality failures, safety failures, financial harm, privacy violations or data leaks, chemical harm, other health or demographic risk, and the potential benefits of responsible investment to improve product quality, safety, or impact; (iii) Stakeholder Opposition - Societal opposition to the company because of controversial sourcing techniques or locations, or other conflicts with local communities; (iv) Social Opportunities - The potential to benefit society by improving access to products G: (i) Corporate Governance - Factors relating to the quality of corporate oversight, including the structure and composition of the board of directors, shareholder ownership structure and control, CEO pay practices, and accounting quality; (ii) Corporate Behaviour - Evidence into the ethical behaviour of the company, including anticompetitive practices, corruption, and tax shielding and transparency.

ychoi08 commented 11 months ago

More on labelling:

Cultural values: words related to common corporate values such as innovation, teamwork, integrity, etc.
Workplace factors: words related to common workplace factors such as work-life balance, compensation & benefits, career opportunities, etc.

nick-sh-oh commented 10 months ago

Glassdoor

Analysed 2.6 million Glassdoor reviews from 540 organisations (2,629,790 positive and 2,629,781 negative reviews). Extracted n-grams and filtered out those occurred at least 1,000 times, using Python:

import re
from collections import Counter
from itertools import chain

def generate_ngrams(sentences, n):
    ngrams_tuples = []
    for sentence in sentences:
        words = sentence.split()
        for i in range(len(words) - n + 1):
            ngrams_tuples.append(tuple(words[i:i + n]))

    ngrams = [' '.join(token) for token in ngrams_tuples]

    return ngrams

special_char_pattern = re.compile(r"[!@#$%^&*()_+{}\[\]:;<>,.?~\\|'`0-9]")

def process(ngrams):
    # Count the occurrences of n-grams
    cnt = Counter(ngrams)

    # Create a new list to store processed n-grams
    processed_ngrams = []

    # Iterate over the n-grams
    for ngram, count in cnt.items():
        if count > 1000:  # Occurs at least 1000
            if not special_char_pattern.search(ngram):  # Check for special characters
                processed_ngrams.append(ngram)  # Add to the processed list

    return processed_ngrams

As a result, there were 4,645 n-grams extracted from the positive reviews (3146 bigrams, 1187 trigrams and 312 fourgrams), and 6,957 n-grams extracted from negative reviews (5075 bigrams, 1631 trigrams and 251 fourgrams). This provided a starting point of frequently mentioned n-grams across a large sample of employee reviews. From this filtered list, @ychoi08 is manually inspecting and categorising n-grams that provide meaningful sentiment across the reviews.

nick-sh-oh commented 10 months ago

Few thoughts on Measuring Corporate Culture Using Machine Learning

The authors constructed a 'relatively exhaustive culture dictionary' of words and phrases that appear in close association with five most-often mentioned values by the S&P500 firms on their corporate website (Li et al., 2021): Innovation, Integrity, Quality, Respect, and Teamwork.

As @ychoi08 pointed out, we have more to add. While Li et al. developed an innovative culture dictionary based on corporate websites and earnings calls, we aim to go significantly wider and deeper in capturing organizational culture. First, rather than focusing on just the five most common values mentioned on websites, we will incorporate a broader, more comprehensive set of cultural values to provide greater coverage. We start from the Big 9 Cultural Values by Culture 500.

Second, we will create rich sub-themes under each value, fleshing out their meaning at a more granular level. For example, "innovation" can be expanded into concepts like creativity, vision, adaptation, and so on. This nested hierarchy will provide precision and nuance beyond surface-level values.

Finally, our lexicon will draw from Glassdoor reviews, going straight to the source - employees themselves. This powerful shift in vantage point from corporate messaging to individual experiences will uncover far more organic and unfiltered language describing real organizational culture. The words and phrases that emerge directly from employees are likely to be more meaningful, honest, and indicative of true workplace culture.

A quick comparison on the construction methods

	Li et al. (2021)	sentibank
Data	209,480 earnings calls from Thomson Reuters’ StreetEvents database over the period 2001–2018 from 7,501 unique firms	5,259,571 employee reviews from Glassdoor (2,629,790 positive and 2,629,781 negative reviews) over the period 2018-2022 from 540 unique firms
Extraction Method	1. After lemmatizing earnings call, used dependency parser from the `Stanford CoreNLP` library to identify multiword expressions and compound words. After identifying n-grams, used phraser of the `gensim` library to find statistically significant bi- and tri-grams (i.e "forward-looking statment" and "beat (a) dead horse").; 2. Word2Vec model is trained using `gensim` library, converting each of the 764,276 words/phrases in the corpus to a 300 dimensional vector that represents the meaning of that word.; 3. From the trained Word2Vec model, get the word vectors for the five most commonly mentioned corporate values on S&P500 websites - innovation, integrity, quality, respect, and teamwork - and their associated seed words from Guiso, Sapienza and Zingales (2015).; 4. Manually inspected the word vectors for these seed words, which were originally compiled from corporate websites (Guiso, Sapienza and Zingales, 2015), to ensure they transfer appropriately to the different genre (earnings calls) while remaining within the domain of corporate culture (see Li et al., 2021, p.3275 on how the authors excluded and added words).; 5. For all 5 corporate values, the authors computed the average of the seed word vectors. The authors calculated the cosine similarity between these averaged vectors and every unique word in the earnings calls. The top 500 most similar words per value were selected and further manually screened to exclude non-fitting terms.	TBD

Embedding Models?

Li et al. (2021) commented that Word2Vec model provides an 'effective way to quantify the semantics, rather than merely the syntactic, at the expression level'. But we raise a fair criticism that Word2Vec model has limitations in capturing true semantic meaning. Li et al. (2021) defined "neighbour words" based on a narrow context window of 5 words (p.3274). The 5 word context window and frequency threshold exemplify how Word2Vec focuses more on statistical patterns within a small context, rather than deeper natural language understanding. This highlights its limitations for full semantic modeling. Further, Word2Vec cannot disambiguate homonyms or polysemous words with multiple context-dependent meanings, since it assigns the same vector regardless of the word's context.

We can perhaps use different word embedding techniques, such as BERT, to identify words associated with certain values (and sub-values).

socius-org / sentibank

Sentiment Dictionary for Organisational Culture & Environment #1