training sample and SVM prediction accuracy for 2021

therealcyberlord / coronavirus_visualization_and_prediction

This repository tracks the spread of the novel coronavirus, also known as SARS-CoV-2. It is a contagious respiratory virus that first started in Wuhan in December 2019. On 2/11/2020, the disease is officially named COVID-19 by the World Health Organization.

https://www.kaggle.com/therealcyberlord/coronavirus-covid-19-visualization-prediction

76 stars 62 forks source link

training sample and SVM prediction accuracy for 2021 #7

Open ntg24gr opened 3 years ago

ntg24gr commented 3 years ago

Hello, great work! I am trying to learn through your code... I have a question regarding your training sample, why you used only 5%. What I know is that it is normally 80:20, for training:testing set. X_train_confirmed, X_test_confirmed, y_train_confirmed, y_test_confirmed = train_test_split(days_since_1_22[50:], world_cases[50:], test_size=0.05, shuffle=False) In addition, from the beginning of the year the prediction of SVM is failing to predict well, while it was super before. What do you think is the reason? Overfitting? Thank you

therealcyberlord commented 3 years ago

The lackluster performance of the SVM model is due to the massive vaccination effort taking place in the world. As a result, the model will overestimate the rate of increase looking at pre-vaccination covid data. The reason I chose a larger training data set is to allow the models to learn from the effects of the vaccines.

The SVM model performed poorly because the hyperparameters were optimized for pre-covid data, while other models are optimized for the current data.

I may make changes in the future to address this, maybe even adding more localized predictions.

therealcyberlord commented 3 years ago

As you can see, setting the testing set to 15% cannot fix the problem with the model. I think it has more to do with the hyperparameters with the SVM model and the nature of the data.

I think having a multi-variable covid prediction model would result in a more accurate results than one with one variable.