gimseng / 99-ML-Learning-Projects

A list of 99 machine learning projects for anyone interested to learn from coding and building projects
MIT License
576 stars 174 forks source link

[EXE] pt1: Simple Decision Tree exercise, pt2: Pipelines #87

Open iakovidva opened 3 years ago

iakovidva commented 3 years ago

Learning Goals

Part 1:

Part 2:

Exercise Statement

Part 1: Apply different Decision Trees to train a model for detecting breast cancer using the breast-cancer-wisconsin-diagnostic-dataset (scikit-learn 7.2.7. Breast cancer wisconsin (diagnostic) dataset). Goal is to predict whether breast cancer is Malignant or Bening.

Part 2: Apply various transformations, imputers, encoders-scalers using Pipelines with DecisionTreeClassifiers. Work with gridsearch to find the best parameters. Goal is to predict whether income exceeds $50K/yr based on census data.

Prerequisites

DecisionTreeClassifier Pipeline SimpleImputer StandardScaler OneHotEncoder ColumnTransformer GridSearchCV

Data source/summary:

Part 1: 569 instances with 30 numeric attributes. Class distribution: 212 - Malignant, 357 - Benign Follow the link below for the full description of the dataset. https://scikit-learn.org/stable/datasets/#breast-cancer-wisconsin-diagnostic-dataset

Part 2: income.csv is used for training set. 32561 instances with 14 attributes, 6 numeric (e.x. age, capital gain, hours-per-week ) and 8 categorical (e.x. workclass, education, race).

income_test.csv is used for testing and report scores. 15315 instances with 14 attributes, 6 numeric (e.x. age, capital gain, hours-per-week ) and 8 categorical (e.x. workclass, education, race).

Goal is to predict whether income exceeds $50K/yr based on census data. Link: http://archive.ics.uci.edu/ml/datasets/Adult

(Optional) Further Links/Credits to Relevant Resources:

This exercise was assigned in the machine learning course at Aristotle University of THessaloniki and the solution was my submission at this.

gimseng commented 3 years ago

@iakovidva Great idea ! Please feel free to work on it and create a PR when you are done. Do check out other existing projects and the contributing guidelines to figure out the practice and format of things. Please do let us know if you have any questions. Thanks !