Open iakovidva opened 3 years ago
@iakovidva Great idea ! Please feel free to work on it and create a PR when you are done. Do check out other existing projects and the contributing guidelines to figure out the practice and format of things. Please do let us know if you have any questions. Thanks !
Learning Goals
Part 1:
Part 2:
Exercise Statement
Part 1: Apply different Decision Trees to train a model for detecting breast cancer using the breast-cancer-wisconsin-diagnostic-dataset (scikit-learn 7.2.7. Breast cancer wisconsin (diagnostic) dataset). Goal is to predict whether breast cancer is Malignant or Bening.
Part 2: Apply various transformations, imputers, encoders-scalers using Pipelines with DecisionTreeClassifiers. Work with gridsearch to find the best parameters. Goal is to predict whether income exceeds $50K/yr based on census data.
Prerequisites
DecisionTreeClassifier Pipeline SimpleImputer StandardScaler OneHotEncoder ColumnTransformer GridSearchCV
Data source/summary:
Part 1: 569 instances with 30 numeric attributes. Class distribution: 212 - Malignant, 357 - Benign Follow the link below for the full description of the dataset. https://scikit-learn.org/stable/datasets/#breast-cancer-wisconsin-diagnostic-dataset
Part 2: income.csv is used for training set. 32561 instances with 14 attributes, 6 numeric (e.x. age, capital gain, hours-per-week ) and 8 categorical (e.x. workclass, education, race).
income_test.csv is used for testing and report scores. 15315 instances with 14 attributes, 6 numeric (e.x. age, capital gain, hours-per-week ) and 8 categorical (e.x. workclass, education, race).
Goal is to predict whether income exceeds $50K/yr based on census data. Link: http://archive.ics.uci.edu/ml/datasets/Adult
(Optional) Further Links/Credits to Relevant Resources:
This exercise was assigned in the machine learning course at Aristotle University of THessaloniki and the solution was my submission at this.