[Viralness] Improving baseline

raaahulss / project_viralnews

MITS Project

MIT License

3 stars 1 forks source link

[Viralness] Improving baseline #76

Closed g2des closed 4 years ago

samteplov commented 4 years ago

This is for training baselines on the new dataset.

samteplov commented 4 years ago

Attempted linear regression model using meta data features on new data set using: follower_count title_len content_len day of week is weekend title_polarity title_subjectivity content_polarity content_subjectivity Media outlet

Yielded following results: Mean Absolute Error: 125.01026896269398 Mean Squared Error: 232115.21050582628 Root Mean Squared Error: 481.78336470433084

Clearly linear regression is doing conservative estimates

TODO: 1.) See what features are actually good predictors of max_retweets (PCA) 2.) Turn problem into multi-class classification problem using ranges of tweets 3.) Remove US News 4.) Eventually incorporate textual features from content and title

samteplov commented 4 years ago

Created multi-class classification problem using 4 different class values (based on quantiles).

Trained LSTM on content+title using word2vec . Accuracy achieved was only 37%, but RMSE was 0.8, which is not too bad. Trained RNN on metadata, was able to achieve 45% accuracy using all metadata features extracted, did not calculate RMSE yet.

TODO: 1.) Play around with how classes are defined. Maybe using quantiles is not the right way to go. 2.) Try a regression problem on the LSTM/RNN models and see what the RMSE is. 3.) Try to incorporate both metadata features and textual features somehow.

samteplov commented 4 years ago

Played with how class values are decided. Decided to ty some arbitrary values (0-500, 501-1000, 1001-5000, >5000). Used a weighted random sampler to take care of class imbalance. After 150 epochs, (content+title, no punc), LSTM was performing w/ 84-85% accuracy and ~0.70 RMSE.

This is good because this model can be used on unpublished articles since no metadata features were used.

Also tried to do a regression problem with the LSTM model, but it only yielded an RMSE of 1.102 (this was with using classes labeled by 4 quantiles).

samteplov commented 4 years ago

Baseline models are completed. Milestone V will consist of tuning and combining models created in Milestone IV.

Closing this issue for now.