Open s2t2 opened 4 years ago
Use material from this new analytics course: https://github.com/prof-rossetti/data-analytics-in-python
Have some example notebooks here:
https://drive.google.com/drive/folders/1pQVM5bq0ykGuXF_JRfvEDFbdd95kPISv?usp=sharing
and slides here:
https://docs.google.com/presentation/d/13fYiA3E5yADSLlScBuL5lbnWY0hKrkqstrEk1DyFkM4/edit?usp=sharing
Split the data, as necessary:
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size=0.2, random_state=99)
feature_cols = ["feature col a", "feature col b", "feature col c"]
target_col = "labels col"
x_train = df_train[feature_cols]
y_train = df_train[feature_cols]
x_test = df_test[target_col]
y_test = df_test[target_col]
Choose a model (and corresponding metrics):
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score #, mean_absolute_error, mean_squared_error
model = LinearRegression()
Train the model:
model.fit(x_train, y_train)
Score the model on training data:
y_train_pred = model.predict(x_train)
print("R^2 SCORE:", r2_score(y_train, y_train_pred))
Score the model on test data:
y_test_pred = model.predict(x_test)
print("R^2 SCORE:", r2_score(y_test, y_test_pred))
The monthly sales predictions exercise in unit 5B is a little weak, and should be replaced with something else, like this titanic kaggle competition, which is a lot more fun and instructive:
https://www.kaggle.com/c/titanic https://www.kaggle.com/c/titanic/data