mozilla / bugbug

Platform for Machine Learning projects on Software Engineering
Mozilla Public License 2.0
503 stars 312 forks source link

Restrict the training set of the RegressionRange model only to regressions #793

Closed marco-c closed 1 year ago

Axhie commented 1 year ago

RegressionModel A regression model provides a function that describes the relationship between one or more independent variables and a response, dependent, or target variable. Predictive modelling techniques such as regression model may be used to determine the relationship between a dataset’s dependent (goal) and independent variables. It is widely used when the dependent and independent variables are linked in a linear or non-linear fashion, and the target variable has a set of continuous values. Thus, regression model approaches help establish causal relationships between variables, modelling time series, and forecasting. Differents RegressionModel to the training set

  1. Linear regression analysis is used to predict the value of a variable based on the value of another variable. The variable you want to predict is called the dependent variable. The variable you are using to predict the other variable's value is called the independent variable.
  2. Logistic regression is a statistical model (also known as logit model) is often used for classification and predictive analytics. Logistic regression estimates the probability of an event occurring, such as voted or didn't vote, based on a given dataset of independent variables.
    1. Polynomial Regression is a technique of polynomial regression analysis is used to represent a non-linear relationship between dependent and independent variables. It is a variant of the multiple linear regression model, except that the best fit line is curved rather than straight.
  3. Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering.
  4. Lasso regression is a regularization technique. It is used over regression methods for a more accurate prediction. This model uses shrinkage. Shrinkage is where data values are shrunk towards a central point as the mean. The lasso procedure encourages simple, sparse models (i.e. models with fewer parameters).
  5. Quantile regression is a type of regression analysis used in statistics and econometrics. Whereas the method of least squares estimates the conditional mean of the response variable across values of the predictor variables, quantile regression estimates the conditional median of the response variable
  6. Bayesian linear regression is a form of regression analysis technique used in machine learning that uses Bayes’ theorem to calculate the regression coefficients’ values. Rather than determining the least-squares, this technique determines the features’ posterior distribution
  7. Principal component analysis (PCA) is used first to modify the training data, and then the resulting transformed samples are used to train the regression.
  8. The partial least squares regression technique is a fast and efficient covariance-based regression analysis technique. It is advantageous for regression problems with many independent variables with a high probability of multicollinearity between the variables. The method decreases the number of variables to a manageable number of predictors, then is utilized in a regression.
  9. Elastic net regression combines ridge and lasso regression techniques that are particularly useful when dealing with strongly correlated data. It regularizes regression models by utilizing the penalties associated with the ridge and lasso regression methods.
Axhie commented 1 year ago

Linear Regression for Training set

`from sklearn.model_selection import train_test_split'

'from sklearn.linear_model import LinearRegression'

'from sklearn.metrics import mean_squared_error'

'import matplotlib.pyplot as plt'

'import pandas as pd'

'import numpy as np`

'housing = pd.read_csv('housing.csv')'

'housing.shape'

'x_train, x_test, y_train, y_test = train_test_split(housing.median_income, housing.median_house_value, test_size = 0.2)'

'regr = LinearRegression()'

'regr.fit(np.array(x_train).reshape(-1,1), y_train)'

'preds = regr.predict(np.array(x_test).reshape(-1,1))'

'y_test.head()'

'pred'

'residuals = preds - y_test'

'plt.hist(residuals)'

'mean_squared_error(y_test, preds) ** 0.5'

reference: 'https://towardsdatascience.com/linear-regression-on-housing-csv-data-kaggle-10b0edc550ed'

Axhie commented 1 year ago

Link to my google colab using linear regresssion for training dataset

https://colab.research.google.com/drive/1G9Ttnqo5qwKZhN4RY6N5Gvo3-D_9tO9g?usp=sharing