Closed g2des closed 4 years ago
Attempted linear regression model using meta data features on new data set using: follower_count title_len content_len day of week is weekend title_polarity title_subjectivity content_polarity content_subjectivity Media outlet
Yielded following results: Mean Absolute Error: 125.01026896269398 Mean Squared Error: 232115.21050582628 Root Mean Squared Error: 481.78336470433084
Clearly linear regression is doing conservative estimates
TODO: 1.) See what features are actually good predictors of max_retweets (PCA) 2.) Turn problem into multi-class classification problem using ranges of tweets 3.) Remove US News 4.) Eventually incorporate textual features from content and title
Created multi-class classification problem using 4 different class values (based on quantiles).
Trained LSTM on content+title using word2vec . Accuracy achieved was only 37%, but RMSE was 0.8, which is not too bad. Trained RNN on metadata, was able to achieve 45% accuracy using all metadata features extracted, did not calculate RMSE yet.
TODO: 1.) Play around with how classes are defined. Maybe using quantiles is not the right way to go. 2.) Try a regression problem on the LSTM/RNN models and see what the RMSE is. 3.) Try to incorporate both metadata features and textual features somehow.
Played with how class values are decided. Decided to ty some arbitrary values (0-500, 501-1000, 1001-5000, >5000). Used a weighted random sampler to take care of class imbalance. After 150 epochs, (content+title, no punc), LSTM was performing w/ 84-85% accuracy and ~0.70 RMSE.
This is good because this model can be used on unpublished articles since no metadata features were used.
Also tried to do a regression problem with the LSTM model, but it only yielded an RMSE of 1.102 (this was with using classes labeled by 4 quantiles).
Baseline models are completed. Milestone V will consist of tuning and combining models created in Milestone IV.
Closing this issue for now.
This is for training baselines on the new dataset.