shantanu-555 / The-Office-Script-Analysis

Script Analysis of 'The Office'. Sentiment Analysis, Topic Modelling and Dialogue Generator.
GNU General Public License v3.0
1 stars 0 forks source link

Project update 1 #1

Open elinewesterbeek opened 1 year ago

elinewesterbeek commented 1 year ago

Project update 1

In the last few weeks, we have made quite some progress. Here is what we have done so far for each of the goals we set in project update 0:

Initial analysis of the data and visualizations

We started off by importing the dataset into a pandas DataFrame and dropping the 'scene' column. We sorted the characters by dialogues spoken and picked the top 20 for our basic analytics. Among those, Michael spoke the most dialogues (as expected) at 12,145 and Creed spoke 456.

dialogues_spoken

Next, we defined a function that takes a character and plots a wordcloud for them by making a collection of the most frequent words spoken that are not stopwords or common english words. We used the wordcloud library (https://pypi.org/project/wordcloud/) for the same.

Here are the most frequently spoken words by Dwight: dwight_wordcloud

Fun Fact? Dwight might be the only character who takes his own name so often.

Annotating dataset by sentiment

To test the accuracy of our sentiment analysis, we each manually annotated 100 lines of dialogue for sentiment (either -1, 0 or 1). For this, we made 3 samples of 100 lines each, and each of us annotated one sample. The code that was used to create the samples is in this notebook. We saved the samples to CSV and then annotated them using a short function that we wrote in this notebook. The annotated data was then saved in 3 different CSV files, which can be found in the annotated_data folder.

Preprocessing/sentiment analysis

To find the optimal preprocessing pipeline and model for sentiment analysis, we tried out different methods and compared their accuracy. The code for this can be found in this notebook. We then tested all the different combinations on the annotated data, and compared their accuracy. The models we tested are:

The different preprocessing pipelines we used are mostly related to dealing with descriptions in the lines, like [to camera] or [laughing]. From the testing we found that simply removing them resulted in the best accuracy. The results in detail can be found in the notebook, but are summarized here:

Model Preprocessing Accuracy Precision MSE
Roberta large Remove descriptions 0.67 0.69 1.29
Roberta large Keep descriptions 0.65 0.67 1.36
Bert uncased Remove descriptions 0.63 0.63 0.46
Bert uncased Keep descriptions 0.62 0.63 0.47
BERT Remove descriptions 0.54 0.54 0.65
BERT Keep descriptions 0.55 0.55 0.66
Distilbert Remove descriptions 0.41 0.27 0.89
Distilbert Keep descriptions 0.42 0.28 0.85

We find that Roberta Large and Bert uncased perform best, with Roberta Large being more accurate. However, Bert uncased has a lower MSE, meaning that it is more confident in its predictions. For now we used Bert uncased for our sentiment analysis, also because it is faster to train and use. Finally we applied the model to the complete dataset. The full dataset, annotated for sentiment, can be found here

Visualizations

Topic modelling:

link to my work

First, I explored and implemented a variety of preprocessing functions. I tried creating multiple topic models, identified areas of improvement in terms of preprocessing and adapted as needed. These preprocessing techniques include things like lemmatization, removing stopwords/most common words, removing names of characters, removing punctuation, etc. I also implemented things like grouping the text by scene and by episode. Since the average line of dialogue was so short, I have been doing topic modelling by scene, but for the next update, it could be interesting to explore by episode as well.

For the actual topic modelling, two models were implemented: NMF and LDA. For NMF, I created functions which calculated and plotted the coherence value and reconstruction error for a variety of numbers of topics, so that I could identify the ideal choice. Similarly, for LDA I had a function that did the same, except it did it for coherence, log-likelihood and perplexity.

For NMF: download

For LDA: download-1

So far, NMF seems to produce more coherent topics, and I have started to see some interesting clusters, for instance:

Then, I also wanted to try zero-shot topic classification, which is a technique where you can input your own topics/categories into the model, and it classifies the text without labelled data. I think this could be a really interesting way to evaluate whether the labels we give to the LDA/NMF topic models correspond well. (so we could check if the zero-shot label == LDA/NMF cluster label) I tried this out with some categories I came up with (not based on LDA/NMF right now, and plotted some word clouds for different topics). For instance, for the topic 'Business and management' this was the word cloud:

download-2

Goals for project update 2:

Questions for Jelke:

bloemj commented 1 year ago

Cool, you have already done a lot!

The accuracy seems a bit low to me, but it could be an effect of being a different domain. The best way to find out is to see what scores these models have achieved on other datasets!

I would say MSE is less important than precision/recall (F-score) but it is still useful to have. So it is probably better to go for the model with the best F-score. The higher MSE could also be due to a larger vocabulary for Roberta Large.