socius-org / sentibank

Encyclopedic Hub for Sentiment Dictionaries
https://doc.socius.org/sentibank/about
Other
12 stars 1 forks source link

Sentiment Dictionary for Organisational Culture & Environment #1

Closed nick-sh-oh closed 3 months ago

nick-sh-oh commented 11 months ago

While collecting and processing, we realised most of the existing sentiment dictionaries out there are applicable in either general (i.e VADER) or financial (i.e MASTER) domains. Other than the domain of political science (i.e the Manifesto Corpus), there is no existing sentiment dictionary in other social domains. But as Loughran and McDonald (2011) commented, 'words have many meanings, and a word categorisation scheme derived for one discipline might not translate effectively into (other) discipline'.

We propose building a sentiment dictionary that measures sentiment in textual data relevant to the organisational culture and enviornment. And here is a brief research design sketch:

1. Data Collection:

2. Filtering:

3. Expanding Verb-forms: Suppose we filtered “promote personal growth” from the previous step. We consider variations of “promote” and expand n-grams by adding “encourage personal growth”, “advance personal growth”, “assist personal growth”, “aid personal growth”, and so on.

4. Labelling:

ychoi08 commented 11 months ago

More on labelling:

nick-sh-oh commented 10 months ago

Glassdoor

Analysed 2.6 million Glassdoor reviews from 540 organisations (2,629,790 positive and 2,629,781 negative reviews). Extracted n-grams and filtered out those occurred at least 1,000 times, using Python:

import re
from collections import Counter
from itertools import chain

def generate_ngrams(sentences, n):
    ngrams_tuples = []
    for sentence in sentences:
        words = sentence.split()
        for i in range(len(words) - n + 1):
            ngrams_tuples.append(tuple(words[i:i + n]))

    ngrams = [' '.join(token) for token in ngrams_tuples]

    return ngrams

special_char_pattern = re.compile(r"[!@#$%^&*()_+{}\[\]:;<>,.?~\\|'`0-9]")

def process(ngrams):
    # Count the occurrences of n-grams
    cnt = Counter(ngrams)

    # Create a new list to store processed n-grams
    processed_ngrams = []

    # Iterate over the n-grams
    for ngram, count in cnt.items():
        if count > 1000:  # Occurs at least 1000
            if not special_char_pattern.search(ngram):  # Check for special characters
                processed_ngrams.append(ngram)  # Add to the processed list

    return processed_ngrams

As a result, there were 4,645 n-grams extracted from the positive reviews (3146 bigrams, 1187 trigrams and 312 fourgrams), and 6,957 n-grams extracted from negative reviews (5075 bigrams, 1631 trigrams and 251 fourgrams). This provided a starting point of frequently mentioned n-grams across a large sample of employee reviews. From this filtered list, @ychoi08 is manually inspecting and categorising n-grams that provide meaningful sentiment across the reviews.

nick-sh-oh commented 10 months ago

Few thoughts on Measuring Corporate Culture Using Machine Learning

The authors constructed a 'relatively exhaustive culture dictionary' of words and phrases that appear in close association with five most-often mentioned values by the S&P500 firms on their corporate website (Li et al., 2021): Innovation, Integrity, Quality, Respect, and Teamwork.

As @ychoi08 pointed out, we have more to add. While Li et al. developed an innovative culture dictionary based on corporate websites and earnings calls, we aim to go significantly wider and deeper in capturing organizational culture. First, rather than focusing on just the five most common values mentioned on websites, we will incorporate a broader, more comprehensive set of cultural values to provide greater coverage. We start from the Big 9 Cultural Values by Culture 500.

Second, we will create rich sub-themes under each value, fleshing out their meaning at a more granular level. For example, "innovation" can be expanded into concepts like creativity, vision, adaptation, and so on. This nested hierarchy will provide precision and nuance beyond surface-level values.

Finally, our lexicon will draw from Glassdoor reviews, going straight to the source - employees themselves. This powerful shift in vantage point from corporate messaging to individual experiences will uncover far more organic and unfiltered language describing real organizational culture. The words and phrases that emerge directly from employees are likely to be more meaningful, honest, and indicative of true workplace culture.

A quick comparison on the construction methods

Li et al. (2021) sentibank
Data 209,480 earnings calls from Thomson Reuters’ StreetEvents database over the period 2001–2018 from 7,501 unique firms 5,259,571 employee reviews from Glassdoor (2,629,790 positive and 2,629,781 negative reviews) over the period 2018-2022 from 540 unique firms
Extraction Method 1. After lemmatizing earnings call, used dependency parser from the Stanford CoreNLP library to identify multiword expressions and compound words. After identifying n-grams, used phraser of the gensim library to find statistically significant bi- and tri-grams (i.e "forward-looking statment" and "beat (a) dead horse").;

2. Word2Vec model is trained using gensim library, converting each of the 764,276 words/phrases in the corpus to a 300 dimensional vector that represents the meaning of that word.;

3. From the trained Word2Vec model, get the word vectors for the five most commonly mentioned corporate values on S&P500 websites - innovation, integrity, quality, respect, and teamwork - and their associated seed words from Guiso, Sapienza and Zingales (2015).;

4. Manually inspected the word vectors for these seed words, which were originally compiled from corporate websites (Guiso, Sapienza and Zingales, 2015), to ensure they transfer appropriately to the different genre (earnings calls) while remaining within the domain of corporate culture (see Li et al., 2021, p.3275 on how the authors excluded and added words).;

5. For all 5 corporate values, the authors computed the average of the seed word vectors. The authors calculated the cosine similarity between these averaged vectors and every unique word in the earnings calls. The top 500 most similar words per value were selected and further manually screened to exclude non-fitting terms.
TBD

Embedding Models?

Li et al. (2021) commented that Word2Vec model provides an 'effective way to quantify the semantics, rather than merely the syntactic, at the expression level'. But we raise a fair criticism that Word2Vec model has limitations in capturing true semantic meaning. Li et al. (2021) defined "neighbour words" based on a narrow context window of 5 words (p.3274). The 5 word context window and frequency threshold exemplify how Word2Vec focuses more on statistical patterns within a small context, rather than deeper natural language understanding. This highlights its limitations for full semantic modeling. Further, Word2Vec cannot disambiguate homonyms or polysemous words with multiple context-dependent meanings, since it assigns the same vector regardless of the word's context.

We can perhaps use different word embedding techniques, such as BERT, to identify words associated with certain values (and sub-values).