The general purpose of this work is to provide foundational code and workbooks to build machine learning model(s) to predict user stance from a YouTube comment on controversial topics (multiclass text classification: neutral, positive, or negative). While the focus for my work was abortion stance detection, the code provided here can easily be altered to build and train similar models for different controversial topics.
This repo includes scripts to handle scraping the YouTube API for comment data for a particular YouTube video, gathering user subscriber info (if desired and accessible), the datasets used for training and testing abortion detection models, and links to the google colab notebooks used to build, train, and test the models.
This section will briefly discuss the utility of each of the nunbered python scripts.
1-gatherYoutubeData.py
This script scrapes the YouTube API for commentThread data from a particular YouTube videoId. A new excel spreadsheet will be generated with all attributes including, but not limited to, commentTextDisplay (the actual comment), and userId. You will need to visit the Google Developer Console and sign up for a developer account (you can just use your gmail account). This will allow you to become authorized to use the YouTube API. Most importantly, you will need a YouTube API Key. Please follow the instructions on the Google Developer Console site. Once you have been authorized, follow the comments in this code to utilize the .env file to store and access your API key (helps mask your API key if uploading code to GitHub, etc.). Identifying the videoID: Consider https://www.youtube.com/watch?v=dQw4w9WgXcQ, the videoId for this video is dQw4w9WgXcQ. Review the comments within the code for the variable which holds this videoId and change accordingly. Finally, you are ready to run the script.
2-findWhoSubscribedTo.py
This script will acquire the subscriber info for a particular user through his/her userId, if the user is subscribed and if the user authorizes such requests through the API. The original intent for this code was to acquire an additional feature to use in the SVM model. After running this code, it became clear subscriber info was not an appropriate feature for the model. Unfortunately, many users have restricted the scraping of this info through the API. You will be lucky to get 1/3 of all users subscriber info. I would not recommend incorporating subscriber info into your model as an additional feature, but if you choose to do so here is the code.
3-NLPAug_Balance_Dataset.py
Note: this script is available as a notebook on google colab (see NLPAug_Balance_Dataset.ipynb in this repo). This script uses nlpaug to balance an unbalanced dataset. Unlike other oversampling techniques, nlpaug will generate synthetic dataset samples while respecting the contextual placement of words within the sentence. This is extremely important for transformer models. You can run this script on your own machine if you have a compatible gpu, otherwise just run this in the google colab notebook under a GPU runtime. If you run this without a GPU, it will take 10x (or more) longer than with a gpu.
4-prepareDatasetForBiLSTMModel.py
This script is intended for use with the BiLSTM model only. The BiLSTM model was developed according to the tutorial found on tensorflow website. As such, the model dictated a strict format in the form of directories and .txt files. This script will generate the dataset structure necessary for this model.
5-clearDatasetForDeepLearning.py
This script is for use with 4-prepareDatasetForBiLSTMModel.py. This will clear and remove the folder generated by 4-prepareDatasetForBiLSTMModel.py for efficient alterations (i.e. if you want to change the train_test_split %, etc.)
Descriptions of use for SVM, BiLSTM, and transformer models.
SVM
This model is straight foward and simply requires a .csv file containing your dataset. Upload this dataset to the google colab notebook and execute.
BiLSTM
This model follows the tensorflow/keras tutorials found at https://www.tensorflow.org/tutorials/keras/text_classification and https://www.tensorflow.org/text/tutorials/text_classification_rnn. Thus, this model utilizes 4-prepareDatasetForBiLSTMModel.py and 5-clearDatasetForDeepLearning.py to prepare and structure the dataset as outlined in the tutorials. Feel free to alter/add/delete layers in the model to see if the model improves. Following the execution of 4-prepareDatasetForBiLSTMModel.py, search the directory for the folder containing your structured dataset. Finally, zip this folder, upload the zipped folder to the google colab notebook, and execute.
Transformer Models
Before jumping into using the google colab notebook, it may be helpful to watch https://www.youtube.com/watch?v=7kLi8u2dJz0 for a basic introduction to BERT and/or https://www.youtube.com/watch?v=OyFJWRnt_AY for a more formal discussion of transformer networks and attention layers. This model utilizes the tranformers library which enables us to easily fine-tune pre-trained models. The steps below outline how to find a pre-trained model and use this in your model:
- Visit the HuggingFace hub at https://huggingface.co/ and select the models tab. This interface provides many pre-trained models available for use.
- Select one of these models. Next to the name of the model is a symbol to copy name to clipboard (click this).
- Open the google colab notebook and find the line of code: tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-base'). This is where you will paste the info copied to the clipboard. Note: for my model, leave this as 'microsoft/deberta-base' (I did not include the tokenizer in my model card).
- Find the line of code: model = TFAutoModelForSequenceClassification.from_pretrained("HereBeCode/deberta-fine-tuned-abortion-stance-detect", num_labels=3) and paste the info saved to the clipboard here as well.
- That's it, you are ready to run the code.