coderschoolreview commented 5 years ago

Could write in this style for cleaner code & easier to read:

condition 1 = dataframe['colA'] == 'Value 1'
condition 2 = dataframe['colA'] > 100 

#number of row in dataframe satisfy both condition 
sum(condition 1 & condition 2)

Overall most answer are correct, missing 2 questions :

Hard: What are the top 5 most popular email providers/hosts (e.g. gmail.com, yahoo.com, etc...)
Implement a bar plot for top 5 most popular email providers/hosts

coderschoolreview commented 5 years ago

Missing question 16

For reference: Regression Metric

Answer

from sklearn.metrics import mean_absolute_error,mean_squared_error
mae = mean_absolute_error = (y_test,y_predict)
mse = mean_square_error = (y_test,y_predict)
rmse = mse**.5

coderschoolreview commented 5 years ago

Good work, all answer are correct

Didn't work cause folder "data" didn't exist. Could fix this by creating folder data in same folder with the notebook.

coderschoolreview commented 5 years ago

Assignment 4:

Visualization:

Not necessary to use lmplot.
Could use some subplot for neater visualization.
Ref: Check out Mai's notebook: https://github.com/janetvn/coderschool_mc_w04_assignment
Subplot ref : https://jakevdp.github.io/PythonDataScienceHandbook/04.08-multiple-subplots.html

Model:

Should not used Logistic Regression, since this is a multi class classification problem (more than 2 classes, we have 3 species)
Check out this article for ref on Logistic Regression for this case
In the future, try fine Tuning the model with GridSearchCV/RandomizedSearchCV

coderschoolreview commented 5 years ago

Thao Ho: Assignment 5

The goal of this assignment was to introduce you following concepts in Machine Learning:

Data quality check: missing value, anomaly, ...
Exploratory Data Analysis: ratio of class, relationship of feature with another and relationship of feature with target, and data visualization.
Preprocessing data: handling imbalance dataset, dealing with categorical variable.
Modeling & evaluation: building classifier and evaluate, Hyper Parameter Tuning.

Things you did well

Checking data quality: check for missing value and using missingno to visualize
EDA: Ratio of classes, finding out that data is imbalance, visualizing relationship between features
Preprocessing: under sampling / over sampling, train test split

Things to work on

Preprocessing categorical features. Try pandas.get_dummies
GridSearchCV and RandomSearchCV: common practice is using RandomSearchCV to narrow down search range then fine tuning with GridSearchCV, can read this article
Modeling & evaluate: train and evaluate multiple model by comparing classification report of models against each other

Minor tips

Almost got it right in Modeling and Evaluation, one minor mistake when define function evaluate model, you should've write it like this:

# Import confusion_matrix, classification_report
from sklearn.metrics import classification_report, confusion_matrix

# We create an utils function, that take a trained model as argument and print out confusion matrix
# classification report base on X and y
def evaluate_model(estimator, X, y, description): #missing `description` argument
  prediction = estimator.predict(X)
  np.set_printoptions(precision=2)
  model_name = type(estimator).__name__
  return {'name': model_name, 
          'recall': recall_score(y, prediction),
          'precision': precision_score(y, prediction),
         'description': description}

Installing packages missingno (or any arbitrary package) on Win10:
- Go to Anaconda Prompt
- Type conda install -c conda-forge missingno
Filtering series:
- Your code:train_copy.isnull()
- Could write like this to get column with null value only:
```
ncols = train_copy.isnull().sum
ncols[ncols!=0]
```
Check if whole data frame have any null value:

train.isnull().any().any()
For evaluation function, should print instead of return so when you loop through list of model and evaluate them, the result for each iteration printed to output.

coderschoolreview commented 5 years ago

Assignment 6

Goal of this Assignment

The goal of this assignment was to introduce you to following concepts:

Unsupervised Learning (KMeans, Hierachical Clustering)
PCA

You learn how to use PCA for dimension reduction, KMeans, and Hierarchical Clustering. Also you learn to visualize the result of both tenichque.

Things you did well:

Ultilizing Pandas to read, summary and visualize data.
Almost got everything right

To sum up:

Great work! This showing you understood the concept and able to apply it. Keep it up!

Few minor tips

Could try this shorter version:

total_purchases = data.sum(axis=1)
purchase_percent = data.div(total_purchases, axis=0) * 100

hongoclanthao / ML-C-CoderSchool

Thanks! #1

Assignment 4:

Visualization:

Model:

Thao Ho: Assignment 5

Things you did well

Things to work on

Minor tips

Assignment 6

Goal of this Assignment

Things you did well:

To sum up:

Few minor tips