re4lvanshsingh / Codeforces_Codechef_Converter

Converts the rating in between two popular competitive programming platforms- Codeforces and Codechef
1 stars 4 forks source link

Train various ML Models on the dataset #3

Closed re4lvanshsingh closed 7 months ago

re4lvanshsingh commented 7 months ago

Train various ML models on the divided dataset (70:30 ratio for training and testing) like Polynomial Regression, Neural Networks etc.

Report your observations to me in the form of a presentation (pptx) with the test-accuracy, f1-score. Make sure to mention the degree of polynomial chosen in models like polynomial regression.

Make sure before training the ML Model, the dataset is big enough (atleast 100+ data points). If the data points are not enough, then use web scraping or manually find the ratings of the same person on both codeforces and codechef to increase the number of data points.

More points for better quality of work. Currently posted as a medium issue but can easily be made a hard issue.

HavokSahil commented 7 months ago

@re4lvanshsingh I am very much interested in doing that, Please assign this issue to me.

re4lvanshsingh commented 7 months ago

@HavokSahil After you correct your pull request, I will assign it

re4lvanshsingh commented 7 months ago

@HavokSahil Great work on the last issue. You can proceed with this issue now. Assigned it.

Some clarity regarding what I want:

Train the following ML Models on the dataset: Linear Regression Ridge Regression Lasso Regression Elastic Net Regression Decision Trees Random Forest Gradient Boosting (e.g., XGBoost, LightGBM) Support Vector Regression (SVR) k-Nearest Neighbors (kNN) Gaussian Process Regression Neural Networks (Feedforward or Deep Learning) Bayesian Regression AdaBoost Extra Trees Bagging Regressor Isotonic Regression Huber Regressor Passive Aggressive Regressor Theil-Sen Regressor Locally Estimated Scatterplot Smoothing (LOESS)

NOTE: preprocessing steps such as feature scaling, normalization, and feature engineering may also impact model performance

For each ML Model report: 1) Mean Squared Error (MSE): Measures the average squared difference between predicted and actual values. Lower values indicate better performance.

2) Root Mean Squared Error (RMSE): The square root of MSE, providing an interpretable metric in the same units as the target variable.

3) Mean Absolute Error (MAE): Measures the average absolute difference between predicted and actual values.

4) R-squared (R²): Represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s). R² ranges from 0 to 1, with 1 indicating a perfect fit.

5) Adjusted R-squared: An adjusted version of R² that penalizes for the number of predictors, providing a more reliable measure for models with multiple features.

6) Coefficient of Determination (COD): Similar to R², indicating the proportion of the variance in the dependent variable explained by the model.

7) Mean Percentage Error (MPE): Measures the percentage difference between predicted and actual values.

8) Mean Absolute Percentage Error (MAPE): Provides the mean percentage difference between predicted and actual values, emphasizing accuracy on a percentage scale.

9) Explained Variance Score: Measures the proportion of variance in the dependent variable explained by the model.

10) F-statistic and p-value: Used in the context of ANOVA to test the overall significance of the regression model.

DONE. Just report the entire observations using a Jupyter Notebook/Google Collab Notebook.

HavokSahil commented 7 months ago

@re4lvanshsingh i have almost completed the work and report, I wanted to talk to you about an issue. Can you provide your email or discord handle

re4lvanshsingh commented 7 months ago

@HavokSahil Sure. Go on CodePeak Discord and scroll to the project Codeforces and Codechef Converter. There you will find my id: thegamingnut. DM me there or on the channel, whichever you prefer

HavokSahil commented 7 months ago

@re4lvanshsingh Here is the link for Jupyter Notebook, before going through that, please check your discord DM https://colab.research.google.com/drive/1u-bkwVM8taIttF9M3X6EECtmWM04wCzX?usp=sharing

re4lvanshsingh commented 7 months ago

@HavokSahil a big screw up on my part. I wanted you to correlate the ratings between Codeforces and Codechef platforms using the ML Models, not the correlation between contest ranks and ratings.

Fix the above Google Collab, by training the models on the new dataset of rating of Codeforces vs rating of Codechef.

re4lvanshsingh commented 7 months ago

@HavokSahil I would give you enough points so dont worry on that part

HavokSahil commented 7 months ago

@re4lvanshsingh i have resolved that part, I haven't removed that plot of Ranks because it still is informative, however only ratings have been used for training and evaluation of the model Screenshot from 2023-12-15 17-44-00

here is the updated file https://colab.research.google.com/drive/1N0q9RB5ZHYYm9gnRVk5XnI19vBvP600H?usp=sharing

re4lvanshsingh commented 7 months ago

@HavokSahil All the ML Models have been trained on their own platform data:

image

meaning train_CCx and train_CCy are used to train the data. What I wanted was the correlation between the two platforms.

Basically, the data should have been of the form: CC_rating on x axis and CF_rating on y axis. This would give me rating for CF if I input a rating number for Codechef.

HavokSahil commented 7 months ago

@re4lvanshsingh thats exactly what I did, it's just naming convention, when you look at the column train_CCx contains Codechef (last five) train_CCy contains Codeforces Rating

train_CFx contains Codeforces Rating (max rating and last five) train_CFx contains Codechef Rating IMG_20231216_073429 You can see it here 'Current Rating' is the feature of Codeforces which I have used for CC label and 'CRating1_y' is last rating of Codechef which I have used for CF label

I named it so to avoid putting CC and CF like {train_CC_to_CF} in both dataset that would be confusing and painful to type.

re4lvanshsingh commented 7 months ago

@HavokSahil Ok nice. My bad. Generate a PR for this issue by downloading the ipynb notebook associated with the G Collab. I need a PR purely for the purpose of documentation. Also comment on the mpld3 issue to get it assigned to you.