jeff1evesque / machine-learning

Web-interface + rest API for classification and regression (https://jeff1evesque.github.io/machine-learning.docs)
Other
256 stars 85 forks source link

Investigate 'bootstrap aggregation' implementation #3025

Open jeff1evesque opened 6 years ago

jeff1evesque commented 6 years ago

We need to investigate the requirements to streamline the following bootstrap aggregation methods:

protojas commented 6 years ago
import random
import numpy as np
import sklearn.ensemble

def bagging(D, Y, X, be=None, k=10):
    # D is the dataset of features
    # Y is the classifications of each point in D
    # X is the points to be classified
    # be is the base estimator (which classifier type to ensemble?) Default
    #    is decision trees
    # k is the number of base estimators to use

    features = 1.0
    # features is how many to features to use for each estimator. 1.0
    # means use them all.

    samples = 1.0
    # samples is how many samples to draw for bagging purposes (see
    # explanation of bagging)

    # set up the ensemble classifier
    bagc = sklearn.ensemble.BaggingClassifier(base_estimator=be,
                                              n_estimators=k,
                                              max_samples=samples,
                                              max_features=features,
                                              verbose=0)

    # train the classifier
    bagc.fit(D, Y)

    # return an array of classifications
    return bagc.predict(X)

# the dataset
D = [[0, 1],
     [0, 2],
     [3, 4],
     [5, 6],
     [6, 3],
     [4, -2],
     [-1, 4],
     [2, 3],
     [9, -1],
     [-2, -4],
     [-1, -5],
     [10, 0],
     [-10, 0]]

# assign the above points to a random class
Y = [random.choice(['A', 'B']) for X in D]

# put it into the right format for sklearn
D = np.array(D)
Y = np.array(Y)

# generate a single random point to classify
X = np.random.randint(low=-10, high=10, size=(1, len(D[0])))

# print out the point and it's classification
XC = bagging(D, Y, X, None, 20)
print "Test 1:"
for xc in range(len(XC)):
    print X[xc], XC[xc]

# generate 10 random points to classify
X2 = np.random.randint(low=-10, high=10, size=(10, len(D[0])))

# print out the 10 points and their classifications
XC2 = bagging(D, Y, X2, None, 30)
print "Test 2:"
for xc2 in range(len(XC2)):
    print X2[xc2], XC2[xc2]
protojas commented 6 years ago

What is a decision tree?

A decision tree typically uses several decisions to classify a point.

For example:

if x > 0 and y = Red, class A else if x = 0 and y = Red, class B else if y = Blue, class C else, class D

======================================

What is an ensemble?

An ensemble classifier usually uses multiple decision trees to classify a point, but it can also combine other classifiers.

The ensemble typically uses either a vote or an average to produce a prediction. In a vote, the class that the majority of classifiers predict is chosen. In an average, the probabilities of each class as predicted by each classifier are averaged and the one with the highest probability is chosen.

So if classifiers the first 3 classifiers predict that the point is class A, but only the fourth and fifth classifiers predict that the point is class B, then a voting ensemble will predict class A.

=======================================

What is bagging (or "bootstrap aggregating")?

Bagging is a way of creating the classifiers for the ensemble.

The bagging method randomly chooses M points from a training dataset (with replacement, so duplicates are possible). Then, it fits a classifier based on this data.

It does this process N times, resulting in N classifiers. These classifiers are then used in a typical ensemble classifier with either voting or averages to predict an outcome.

In the code, M is represented by "samples". N is represented by "k".

jeff1evesque commented 6 years ago

This is very good. Assume that I don't understand the meaning of the variables for this bagging classifier, and I only understand basic statistics (dependent vs independent variables). What does each point in the dataset D represent? For example, the current database, is unstructured. However, there will be an assumption that it will be structured. Specifically, every dataset ingested into the application, should theoretically, be useable for any algorithm, with the exception, that the properties, will really define how the dataset(s) will be used. Perhaps, also more specifically describe X, and Y.

Could you also research the implementation for the bagging regressor. When we create the python wrappers to these functionalities, it should be pretty trivial. Just a matter of querying things out of mongodb, and ingesting it into the format you have demonstrated.

protojas commented 6 years ago

Each point in the dataset is typically a series of measurements or attributes about the thing being measured.

For example, if we take a group of people and for each person we measure their height in cm and their weight in lbs, one such point (x,y) might look like this:

(500cm, 120lbs)

Attributes can also include set based values such as eye color. Even though they're non-numeric ('Brown', 'Green', 'Blue'), decision trees will easily handle them (where other algorithms will be less forgiving).

So D, in the case of height and weight, could look like this: np.array([[500,120], [400,123], [520, 150]]). and it represents 3 people's data. One person is 500cm and 120lbs, another is 400cm and 123lbs, and the last is 520cm and 150lbs. D is intended to be a training set, which means classification is known for this set. So we should also include Y, which are the classifications.

For example, if we were attempting to predict the ethnicity of each of these people based on their measurements, Y might look like:

np.array(['American', 'Korean', 'American'])

, which indicates that the person with the measurements 500cm and 120lbs is American, and so on. The number of items in Y should be equivalent to the number of points in D.

The classifier is trained on this data so when you attempt to predict point(s) X: (510cm, 130lbs)

it can give its guess of American, or Korean based on the data it was provided.

Typically, it is desirable to predict more than one point X, so sklearn asks for an array of points to predict.

This is how basically all classifiers in sklearn work. D, Y, and X are common to just about every classifier, with some exceptions.

===============================

Specifically for bagging, what bagging does is attempt to improve the stability of certain algorithms by using multiple "different" training datasets, all generated from the same dataset, to train several classifiers which then vote on the class prediction of new points.

As an example, using the height and weight dataset again.

[500,120] [400,123] [520,150] [450,160] [600,180] [580,220]

Lets assume samples=4 and k=3

Then the bagging classifier creates 3 decision trees from 3 different training sets. The three training sets, D1, D2, and D3 might look like this: D1 = [[500,120], [500,120], [450,160], [600,180]] D2 = [[580,220], [600,180], [400,123], [500,120]] D3 = [[450,160], [580,220], [500,120], [520,150]]

Each sample dataset is randomly chosen from the initial dataset D, with replacement. So duplicates may be chosen - note that [500,120] is repeated in D1. Also, since it's random, some points may never be used - note that [400,123] is in none of the new datasets.

D1, D2, and D3 are then used to create three decision trees, T1, T2, and T3.

k=3 means the number of trees here, and samples=4 means the number of points in each sample dataset.

These trees are created and stored internally.

Then, when presented with the points in X to predict, the classifier makes a prediction with each T1, T2, and T3.

For example, T1 might predict the sample point describes a Korean, while T2 and T3 might agree it describes an American.

The ensemble recognizes that there are more votes to predict an American, so it returns "American" for this point.

The number of trees/classifiers k should be greater than the total number of classes, and is also an odd number.

If there are 3 classes and k=3, then each tree could predict a different class and the ensemble wouldn't be able to come up with an answer. Similarly, if k is even, there could be a 2-2 split between the classifiers and the ensemble wouldn't be able to decide.

protojas commented 6 years ago
import random
import numpy as np
import sklearn.ensemble

def baggingR(D, Y, X, be=None, k=10):
    # D is the dataset of features
    # Y is the classifications of each point in D
    # X is the points to be classified
    # be is the base estimator (which classifier type to ensemble?) Default
    #    is decision trees
    # k is the number of base estimators to use

    features = 1.0
    # features is how many to features to use for each estimator. 1.0
    # means use them all.

    samples = 1.0
    # samples is how many samples to draw for bagging purposes (see
    # explanation of bagging)

    # set up the ensemble classifier
    bagc = sklearn.ensemble.BaggingRegressor(base_estimator=be,
                                             n_estimators=k,
                                             max_samples=samples,
                                             max_features=features,
                                             verbose=0)

    # train the classifier
    bagc.fit(D, Y)

    # return an array of classifications
    return bagc.predict(X)

# the dataset
D = [[0, 1],
     [0, 2],
     [3, 4],
     [5, 6],
     [6, 3],
     [4, -2],
     [-1, 4],
     [2, 3],
     [9, -1],
     [-2, -4],
     [-1, -5],
     [10, 0],
     [-10, 0]]

# assign the above points to a random class
Y = [random.uniform(-1.0, 1.0) for X in D]

# put it into the right format for sklearn
D = np.array(D)
Y = np.array(Y)

# generate a single random point to classify
X = np.random.randint(low=-10, high=10, size=(1, len(D[0])))

# print out the point and it's classification
XC = baggingR(D, Y, X, None, 20)
print "Test 1:"
for xc in range(len(XC)):
    print X[xc], XC[xc]

# generate 10 random points to classify
X2 = np.random.randint(low=-10, high=10, size=(10, len(D[0])))

# print out the 10 points and their classifications
XC2 = baggingR(D, Y, X2, None, 30)
print "Test 2:"
for xc2 in range(len(XC2)):
    print X2[xc2], XC2[xc2]
protojas commented 6 years ago

The BaggingRegressor has almost the same use as the BaggingClassifier. The only difference I can see in terms of functionality is that the Regressor doesn't predict from a set of classes, it predicts from a range of real numbers. So instead of having Y be a choice between ['A', 'B'] for each point, it's a choice of a real number between [-1.0, 1.0] in this case. The range is arbitrary, but it's essentially a way of assigning a "score" to each point instead of a classification. This might be useful if you were trying to rank people's obesity on a scale of 1-10, rather than trying to guess their ethnicity.

jeff1evesque commented 6 years ago

What are some of the other (supported / nonsupported) options, other than the decision tree base_estimator, for either the classifier, or the regressor? Can we do random forrest, or some other approach? I wonder if we can in the future, bootstrap a custom estimator (less important right now).

protojas commented 6 years ago

Well this is actually very close to a random forest already. A random forest works the same way, the only real difference is in how the trees are generated. Bagging is more versatile cause it can be applied to any "base estimator", while random forests use decision trees exclusively. In random forests, the decision tree generation is more random.

Bagging uses randomness to generate k trees. All the trees use the same number of features, though. In Random Forests, each tree randomly selects a subset of the features and builds the tree based off that.

It does this primarily to reduce the correlation between trees.

There's a RandomForestClassifier which is separate from the Bagging one.

The base estimator can basically be any type of classifier. You just give it an object and it'll train the data using the bagging method on that classifier instead of decision trees and apply the vote method. Example:

BaggingClassifier(KNeighborsClassifier())

Not all classifiers work as well as decision trees for bagging which is why they use trees as the default

jeff1evesque commented 6 years ago

Can the api readily support the following classifiers, and regressors as its base estimator:

How would we be able to pipe the above estimators, using the sample bagging functions you've outlined?

jeff1evesque commented 6 years ago

Ah, took a little bit more time thinking of your statement:

The base estimator can basically be any type of classifier. You just give it an object and it'll train the data using the bagging method on that classifier instead of decision trees and apply the vote method. Example:

BaggingClassifier(KNeighborsClassifier())

So, basically we just supply the estimator as an argument to the bagging classifier, or regressor. However, I'm still curious when you say any type of classifier, if this means literally any classifier, and regressor. I wonder if we can actually perform any of the non-exhaustive list:

How compatible is the sklearn api, between bagging with above classifiers, and regressors?

protojas commented 6 years ago

I'll do a test in a bit and see

protojas commented 6 years ago
# using svms instead of decision trees
XC2 = bagging(D, Y, X2, sklearn.svm.SVC(), 30)
print "SVM:"
for xc2 in range(len(XC2)):
    print X2[xc2], XC2[xc2]

# using k nearest neighbors
XC2 = bagging(D, Y, X2, sklearn.neighbors.KNeighborsClassifier(), 30)
print "KNN:"
for xc2 in range(len(XC2)):
    print X2[xc2], XC2[xc2]

# using random forests instead of decision trees (note this means 
# MULTIPLE random forests, not just one forest)
XC2 = bagging(D, Y, X2, sklearn.ensemble.RandomForestClassifier(), 30)
print "Random Forests:"
for xc2 in range(len(XC2)):
    print X2[xc2], XC2[xc2]

# using multiple bagging classifiers in a bagging classifier
XC2 = bagging(D, Y, X2, sklearn.ensemble.BaggingClassifier(), 30)
print "Bagging:"
for xc2 in range(len(XC2)):
    print X2[xc2], XC2[xc2]
protojas commented 6 years ago

those all work without errors

SVR doesn't fit into Bagging Classifier because it's regression-based not classifier-based. It'd most likely fit into BaggingRegressor though

jeff1evesque commented 6 years ago

That's great! My guess, is that all of them would work for the BaggingRegressor, except the svm? Basically, an inverse of your most previous statement.

protojas commented 6 years ago
# SVM
XC2 = baggingR(D, Y, X2, sklearn.svm.SVR(), 30)
print "SVM:"
for xc2 in range(len(XC2)):
    print X2[xc2], XC2[xc2]

# KNN
XC2 = baggingR(D, Y, X2, sklearn.neighbors.KNeighborsRegressor(), 30)
print "KNN:"
for xc2 in range(len(XC2)):
    print X2[xc2], XC2[xc2]

# Random Forests
XC2 = baggingR(D, Y, X2, sklearn.ensemble.RandomForestRegressor(), 30)
print "Test 2:"
for xc2 in range(len(XC2)):
    print X2[xc2], XC2[xc2]

# Bagging Regressor
XC2 = baggingR(D, Y, X2, sklearn.ensemble.BaggingRegressor(), 30)
print "Test 2:"
for xc2 in range(len(XC2)):
    print X2[xc2], XC2[xc2]
protojas commented 6 years ago

you'd have to use the regressor for each since the classes are in a different format

The Classifiers are trying to sort it into a finite set of classes (A,B,C) whereas Regressors are trying to sort it into an infinite range of real numbers (anywhere from -10.0 to 10.0)

jeff1evesque commented 6 years ago

I'm a little confused by:

you'd have to use the regressor for each since the classes are in a different format

Makes sense that classifiers are finite by nature, and regressors are contained within the real number set. But, even though the real numbers within -10 to 10 is infinite, do you mean the restriction between the regression sets, is -INF to +INF? That doesn't make sense to me, is it even worth noting?

Also, I like the generalized outline you have above. I'm thinking maybe we can do one, or maybe two more concrete examples (one of each). Then, perhaps we can either rearrange all current issues, or simply dedicate a [release-1.0] branch, and merge everything related to this research into that branch. The latter makes more sense to me. For example, we can start making wrapper functions, to each of the above estimators. Then, retain the object, in case we want to supply them to the bootstrap aggregation methods. So, we'll probably need to make a wrapper to the bootstrap aggregation methods, along with each of the above estimators.

jeff1evesque commented 6 years ago

Though, it's not very essential that we do 1-2 examples, would be nice. If the BaggingClassifier, and BaggingRegressor accepts the returned object, of the corresponding estimators you've outlined, that's great news in itself. If that's the case, seems we are in a good position to create some wrappers, up to you.

protojas commented 6 years ago

Makes sense that classifiers are finite by nature, and regressors are contained within the real number set. But, even though the real numbers within -10 to 10 is infinite, do you mean the restriction between the regression sets, is -INF to +INF? That doesn't make sense to me, is it even worth noting?

I just mean that while Classifiers use a fixed set of classes, using only things provided in Y, whereas Regressors use real numbers, which can potentially include things not explicitly stated in Y. So if Y looks like [3.4, 2.5, 2.2], a Classifier could only predict those 3 numbers but a Regressor would be able to predict any real number in the range provided.

So if you were to attempt to use a Regressor in a BaggingClassifier, the Regressor could potentially return 2.23 as it's "vote" and the Classifier wouldn't understand this because it thinks the only possible votes are "3.4", "2.5", and "2.2".

Similarly, if you attempted to use a Classifier in a Regressor, the Classifier is built to handle distinct classes, including strings, so sklearn prevents you from feeding the output into a Bagging Regressor because it wouldn't make sense to do so.

protojas commented 6 years ago

You could theoretically use the Y values [1,2,3] for both classifiers and regressors, but it wouldn't really make sense to do feed a Classifier into a Regressor from a math perspective cause you'd get results like 1.5 and it'd be meaningless if all the training data was classified into integers

jeff1evesque commented 6 years ago

I think it's possible to obtain values, of real numbers, outside the range of your (data)set, especially for poorly trained models. In the simplest linear case, you'd hope prediction results to be on that nice line (at least within some reasonable epsilon). But, sometimes the value you obtain doesn't quite fit that line, and there is some error. Similarly, on multi-dimensional regression models, you'll have similar potential errors. But, you probably meant to say that it could be any real number within the set, or otherwise. That's a nice tie together with Y. It's been sometime, since I have had the chance to review these concepts.

protojas commented 6 years ago

Ohh I get what you were saying

Yea I think it's possible to get numbers outside the range, you'd just have to ask it to predict some outrageous point in comparison to your training data