henryliangt / usyd

0 stars 0 forks source link

5318 w5 #7

Open henryliangt opened 1 year ago

henryliangt commented 1 year ago

conda install -c conda-forge python-graphviz conda install -c anaconda pip echo %PATH% where Python where conda

import graphviz from graphviz import Digraph

henryliangt commented 1 year ago

Which attribute to select?

The measure of purity that we will use is called information gain based on another measure called entropy homogeneity (purity)

henryliangt commented 1 year ago

Different post-pruning methods, e.g.: • sub-tree replacement • sub-tree raising • converting the tree to rules and then pruning them

henryliangt commented 1 year ago

Gain ratio is a modification of information gain that reduces its bias towards highly branching attributes • It takes into account the number of branches when choosing an attribute and penalizes highly-branching attributes

henryliangt commented 1 year ago

Variations: purity can be measured in different ways, e.g. CART uses Gini Index not entropy

henryliangt commented 1 year ago

Ensemble methods • Bagging • Boosting – AdaBoost and Gradient Boosting • Random Forest

henryliangt commented 1 year ago

Methods for constructing ensembles

Manipulating the training data – creating multiple training sets by resampling the original data according to some sampling distribution and constructing a classifier for each training set (e.g. Bagging and Boosting)

Manipulating the attributes – using a subset of input features (e.g. Random Forest and Random Subspace)

Manipulating the class labels – will not be covered (e.g. errorcorrecting output coding)

Manipulating the learning algorithm – e.g. building a set of classifiers with different parameters

henryliangt commented 1 year ago

Bagging

also called bootstrap aggregation

A bootstrap sample - definition: • Given: a dataset D with n example (the original dataset) • Bootstrap sample D’ from D: contains also n examples, randomly chosen from D with replacement (i.e. some examples from D will appear more than once in D’, some will not appear at all)

63%??

Applying Bagging to regression tasks - the individual predictions are averaged ??

henryliangt commented 1 year ago

Boosting AdaBoost and Gradient Boosting

weighed training set

henryliangt commented 1 year ago

Uses decision trees as base learners, typically shallow (weak) trees

how shallow ?

Use voting (for classification) and averaging?? (for prediction) to combine the outputs of the individual learners

henryliangt commented 1 year ago

Random Forest algorithm

n - number of training examples, m – number of all features, k – number of features to be used by each ensemble member (k<m), M – number of ensemble members Model generation: For each of M iteration

  1. Bagging – generate a bootstrap sample Sample n instances with replacement from training data
  2. Random feature selection for selecting the best attribute Grow decision tree without pruning. At each step select the best feature to split on by considering only k randomly selected features and calculating information gain Classification: Apply the new example to each of the t decision trees starting from the root. Assign it to the class corresponding to the leaf. Combine the decisions of the individual trees by majority voting
henryliangt commented 1 year ago

Performance depends on • Accuracy of the individual trees?? (strength of the trees) • Correlation between the trees

Ideally: accurate individual trees but less correlated

Random Forest typically outperforms a single decision tree • Robust to overfitting

henryliangt commented 1 year ago

Diversity is generated by manipulating the • training data (Bagging, Boosting) • attributes (Random Forest = bagging + random selection of attributes) • learning algorithm

henryliangt commented 1 year ago

Deterministic procedure 确定性程序 homogeneity (purity) 同质性(纯度) signal compression 信号压缩 ensembles Ensemble methods 集成方法 • Bagging • Boosting – AdaBoost and Gradient Boosting • Random Forest

The base classifiers are identical 基本分类器是相同的 The base classifiers are independent 基分类器是独立的 bootstrap samples 引导样本 substantially 实质上 majority vote 多数票

henryliangt commented 1 year ago

image

henryliangt commented 1 year ago

shape: circle = 3/ 8, sq = 4/ 8 , triangle = 1 / 8

class : + 5/8 - 3/8