Project update 1

In the last few weeks, we have made quite some progress. Here is what we have done so far for each of the goals we set in project update 0:

Initial analysis of the data and visualizations

We started off by importing the dataset into a pandas DataFrame and dropping the 'scene' column. We sorted the characters by dialogues spoken and picked the top 20 for our basic analytics. Among those, Michael spoke the most dialogues (as expected) at 12,145 and Creed spoke 456.

dialogues_spoken

Next, we defined a function that takes a character and plots a wordcloud for them by making a collection of the most frequent words spoken that are not stopwords or common english words. We used the wordcloud library (https://pypi.org/project/wordcloud/) for the same.

Here are the most frequently spoken words by Dwight: dwight_wordcloud

Fun Fact? Dwight might be the only character who takes his own name so often.

Annotating dataset by sentiment

To test the accuracy of our sentiment analysis, we each manually annotated 100 lines of dialogue for sentiment (either -1, 0 or 1). For this, we made 3 samples of 100 lines each, and each of us annotated one sample. The code that was used to create the samples is in this notebook. We saved the samples to CSV and then annotated them using a short function that we wrote in this notebook. The annotated data was then saved in 3 different CSV files, which can be found in the annotated_data folder.

Preprocessing/sentiment analysis

To find the optimal preprocessing pipeline and model for sentiment analysis, we tried out different methods and compared their accuracy. The code for this can be found in this notebook. We then tested all the different combinations on the annotated data, and compared their accuracy. The models we tested are:

Roberta large siebert/sentiment-roberta-large-english
BERT sbcBI/sentiment_analysis_model
Distilbert distilbert-base-uncased-finetuned-sst-2-english
Bert uncased Seethal/sentiment_analysis_generic_dataset

The different preprocessing pipelines we used are mostly related to dealing with descriptions in the lines, like [to camera] or [laughing]. From the testing we found that simply removing them resulted in the best accuracy. The results in detail can be found in the notebook, but are summarized here:

Model	Preprocessing	Accuracy	Precision	MSE
Roberta large	Remove descriptions	0.67	0.69	1.29
Roberta large	Keep descriptions	0.65	0.67	1.36
Bert uncased	Remove descriptions	0.63	0.63	0.46
Bert uncased	Keep descriptions	0.62	0.63	0.47
BERT	Remove descriptions	0.54	0.54	0.65
BERT	Keep descriptions	0.55	0.55	0.66
Distilbert	Remove descriptions	0.41	0.27	0.89
Distilbert	Keep descriptions	0.42	0.28	0.85

We find that Roberta Large and Bert uncased perform best, with Roberta Large being more accurate. However, Bert uncased has a lower MSE, meaning that it is more confident in its predictions. For now we used Bert uncased for our sentiment analysis, also because it is faster to train and use. Finally we applied the model to the complete dataset. The full dataset, annotated for sentiment, can be found here

Visualizations

Topic modelling:

link to my work

First, I explored and implemented a variety of preprocessing functions. I tried creating multiple topic models, identified areas of improvement in terms of preprocessing and adapted as needed. These preprocessing techniques include things like lemmatization, removing stopwords/most common words, removing names of characters, removing punctuation, etc. I also implemented things like grouping the text by scene and by episode. Since the average line of dialogue was so short, I have been doing topic modelling by scene, but for the next update, it could be interesting to explore by episode as well.

For the actual topic modelling, two models were implemented: NMF and LDA. For NMF, I created functions which calculated and plotted the coherence value and reconstruction error for a variety of numbers of topics, so that I could identify the ideal choice. Similarly, for LDA I had a function that did the same, except it did it for coherence, log-likelihood and perplexity.

For NMF: download

For LDA: download-1

So far, NMF seems to produce more coherent topics, and I have started to see some interesting clusters, for instance:

Topic 0: friendship and fun: great thanks sound laugh tonight place night everybody question friend movie coming listen happy guess stuff talking wanted today salesman
Topic 1: phone calls with clients and messages: phone hello client answer calling listen imitating sound transfer voice ringing month howard second later whisper meeting idiot saying message
Topic 2: ??: thing talking everybody saying listen wanted business start believe night guess money called getting meeting totally trying stuff place better
Topic 3: conferences, meetings and other office/business activities: office raise everybody manager today place camera deangelo talking start conference hello think second leave looking holly friend people meeting
Topic 4: friendship and its emotions: sorry laugh friend second hello manager voice night better trying thanks wanted funny pretty getting minute stupid guess probably laughing
Topic 5: the company (which is called dunder mifflin, a paper company in scranton): paper dunder mifflin company business scranton manager hello client start today question talking branch salesman people better second dollar money
Topic 6: the office parties (often during christmas and birthdays -- thrown by the party planning committee): party christmas start everybody holly committee throw planning break coming birthday question wanted pizza tonight friend starting important night laugh
Topic 7: crazy scenarios (like prison Mike)?? : point woman prison coming crazy check today better stuff friend pretty question probably thanks break meeting people wrong mouth laughter

Then, I also wanted to try zero-shot topic classification, which is a technique where you can input your own topics/categories into the model, and it classifies the text without labelled data. I think this could be a really interesting way to evaluate whether the labels we give to the LDA/NMF topic models correspond well. (so we could check if the zero-shot label == LDA/NMF cluster label) I tried this out with some categories I came up with (not based on LDA/NMF right now, and plotted some word clouds for different topics). For instance, for the topic 'Business and management' this was the word cloud:

download-2

Goals for project update 2:

analyse sentiment per character, episode, season etc.
create wordclouds for LDA/NMF topics
see which characters dominate which topics
see which topics dominate which episodes
using zero-shot classification (for now I just played around with it)

Questions for Jelke:

Is an accuracy of 0.63 reasonable for a 3-class sentiment analysis? Or do these models usually perform better?
Is MSE an important metric in sentiment analysis? Does it make sense to use Bert uncased instead of Roberta Large?

shantanu-555 / The-Office-Script-Analysis

Project update 1 #1