This is the repo for swyx's blog - Blog content is created in github issues, then posted on swyx.io as blog pages! Comment/watch to follow along my blog within GitHub
source: devto
devToUrl: "https://dev.to/swyx/supervised-learning-computational-learning-theory-160h"
devToReactions: 15
devToReadingTime: 6
devToPublishedAt: "2019-02-15T07:03:05.570Z"
devToViewsCount: 541
title: "Supervised Learning: Computational Learning Theory"
published: true
category: note
description: What's the big O of machine learning? Lets put some formal theory around HOW we learn!
tags: Machine Learning, Supervised Learning
⚠️ This chapter is a LOT more theoretical than the others but set up the theoretical foundations for machine learning - skip or persevere, your choice but be warned
We've learned about various machine learning algorithms - but what do we know about evaluating algorithms and the theory of learning itself? Here, we want to:
define learning problems
show whether specific algorithms are effective/ineffective with regard to the problem
show some problems are fundamentally hard (aka unsolvable)
The Core Question
What's the Big O of machine learning?
We know about Big O of regular algorithms from Complexity Theory, to estimate how resource consumption scales with the problem size (e.g. for space and time). With ML, those constraints are also true, but the new limited resource is data.
Resource usage in ML
Much like regular algorithms have a Big(O), ML algorithms also have Big(O)s for time and space, but also data consumption.
When we evaluate algorithms, we might want to know things like:
probability of successful training (1 - delta)
number of examples to train on (m or n)
complexity of hypothesis class (complexity of H)
Accuracy we are approximating target concept (Epsilon)
presenting training examples (big batch, or incremental/iterative)
selecting training examples (explained below)
Training Examples aren't always random
Having control over the questions being asked (and therefore the examples to train on) can vary data consumption wildly. For example:
if the person who knows the answer controls the questions (and therefore examples), they can ask very specific questions to get you to the right concept based on limited examples
if the learner controls the questions, then they can design questions (and therefore examples) to eliminate as much of the hypothesis space as possible each time (like in a decision tree
However if the data is controlled by nature (aka randomness, or sampling) then a lot more examples (how many?) will be needed to infer the right concept
You can also consider an adversarial teacher that deliberately shows you real but unrepresentative examples to mislead you. FAR more data would be required in this case.
In practice, most training examples are randomly distributed. But it is worth considering how efficiently we can learn from other distributions.
An additional consideration is balance. We learn most from positive examples ("Absence of evidence is not evidence of absence"), so if those are few and far between then we will need a LOT of examples to learn the right concept (aka falsify wrong hypotheses).
Mistake Bounds
Instead of asking how accurate our learning is to the "real" underlying concept, we can flip the script and try to set an upper bound on the mistakes we make while learning. This is a fascinating idea for linear binary classification models with no noise.
For example, let's say we have a possibility space of K bits (1 or 0). So 10 bits have 1024 (2 ^10) possibilities. We start by extreme overfitting to the first example we see (so that the first example we saw was the ONLY possible result from our current hypothesis). Then we continually relax the conditions (widening the hypothesis) as we come across more examples. Because each bit only has three possible states (positive, negative, and absent), new examples which conflict with our hypothesis must not matter to the answer, and so we can switch them to absent.
In this way we guarantee that our learning will never make more than K+1 mistakes, for a 2^K possibility space. A cute trick, but it doesn't quite put upper bounds on how many examples we still need, which is the original question.
Questions revisited
We now have fleshed out some competing measures of "big O for Machine Learning":
Computational Complexity: How much computational effort is needed for a learner to converge? (closest to the classical "big O")
Sample Complexity: How many training examples in a batch are needed for a learner to create a successful hypothesis?
Mistake bounds: How many mistakes can a learner make over an infinite run?
For practical purposes we will ultimately just focus on Sample Complexity, but it is worth considering these other forms.
Version Spaces
Some definitions:
The hypothesis space H is the set of all possible hypotheses we are willing to consider
The True hypothesis c is the "real" one (aka the "target concept")
The candidate hypothesis h is what we're currently considering
The training set S is example data (for example uniformly randomly drawn from the population)
A consistent learner is one that produces a hypothesis that is consistent with the data its seen so far
The version space is the set of all hypotheses that are consistent with the data.
Basically, its "what we haven't ruled out yet". Silly to define, but worth having a term for.
PAC Learning
More definitions:
The Training error is the fraction of training examples misclassified by h. The true hypothesis would have a training error of 0.
The True error is the fraction of examples that would be misclassified on sample data.
C: Concept Class
L: Learner
H: Hypothesis space
n: size of hypothesis space |H|
D: distribution over inputs
episilon: error goal (less than or equal to 0.5)
delta: certainty goal (with probability 1-delta, the algorithm will produce a true error less than or equal to epsilon)
Because we are always learning from samples, we can never reduce our uncertainty to zero, nor our error, and thus our goal is better phrased as looking for a Probably Approximately Correct hypothesis.
Formally: C is said to be PAC-learnable by L using H if and only if L will, with probability 1-delta, output a h from H such that error(h) < epsilon in time and samples polynomial in 1/epsilon, 1/delta, and n.
Epsilon exhaustion
A version space is epsilon exhausted iff all remaining hypothesis have low error. Then you can choose any of the hypothesis and be acceptably fine according to your chosen error rate.
In other words, it is Probably Approximately Correct.
Haussler's Theorem
This is a theorem for bounding the true error as a function of the number of examples that are drawn. We'll leave the derivation to a video if you're interested:
{% youtube TpQoiUQSPB0 %}
{% youtube KtZTIvuRdss %}
The TL;DR conclusion is that the upper bound for a version space not being epsilon exhausted after m samples is:
m >= 1/epsilon * (ln n + ln (1/delta))
So we can work out, for example, that for an input space of 10 bits (1024 possibilities), target epsilon of 0.1, delta of 0.2, we will need at least m = 1/10 * (ln 10 + ln (1/0.2)) ~= 40 training examples, or less than 4% of the possible space.
The theorem also shows what levers to pull and the consequences of those levers. For example, to reduce epsilon down to 1%, we'd then need 400 examples, or 40% of the total possible space.
We are now able to place sample complexity bounds for PAC learning with this equation.
source: devto devToUrl: "https://dev.to/swyx/supervised-learning-computational-learning-theory-160h" devToReactions: 15 devToReadingTime: 6 devToPublishedAt: "2019-02-15T07:03:05.570Z" devToViewsCount: 541 title: "Supervised Learning: Computational Learning Theory" published: true category: note description: What's the big O of machine learning? Lets put some formal theory around HOW we learn! tags: Machine Learning, Supervised Learning
This is the 8th in a series of class notes as I go through the Georgia Tech/Udacity Machine Learning course. The class textbook is Machine Learning by Tom Mitchell.
We've learned about various machine learning algorithms - but what do we know about evaluating algorithms and the theory of learning itself? Here, we want to:
The Core Question
What's the Big O of machine learning?
We know about Big O of regular algorithms from Complexity Theory, to estimate how resource consumption scales with the problem size (e.g. for space and time). With ML, those constraints are also true, but the new limited resource is data.
Resource usage in ML
Much like regular algorithms have a Big(O), ML algorithms also have Big(O)s for time and space, but also data consumption.
When we evaluate algorithms, we might want to know things like:
1 - delta
)m
orn
)complexity of H
)Epsilon
)Training Examples aren't always random
Having control over the questions being asked (and therefore the examples to train on) can vary data consumption wildly. For example:
In practice, most training examples are randomly distributed. But it is worth considering how efficiently we can learn from other distributions.
An additional consideration is balance. We learn most from positive examples ("Absence of evidence is not evidence of absence"), so if those are few and far between then we will need a LOT of examples to learn the right concept (aka falsify wrong hypotheses).
Mistake Bounds
Instead of asking how accurate our learning is to the "real" underlying concept, we can flip the script and try to set an upper bound on the mistakes we make while learning. This is a fascinating idea for linear binary classification models with no noise.
For example, let's say we have a possibility space of K bits (
1
or0
). So 10 bits have 1024 (2 ^10
) possibilities. We start by extreme overfitting to the first example we see (so that the first example we saw was the ONLY possible result from our current hypothesis). Then we continually relax the conditions (widening the hypothesis) as we come across more examples. Because each bit only has three possible states (positive, negative, and absent), new examples which conflict with our hypothesis must not matter to the answer, and so we can switch them toabsent
.In this way we guarantee that our learning will never make more than
K+1
mistakes, for a2^K
possibility space. A cute trick, but it doesn't quite put upper bounds on how many examples we still need, which is the original question.Questions revisited
We now have fleshed out some competing measures of "big O for Machine Learning":
For practical purposes we will ultimately just focus on Sample Complexity, but it is worth considering these other forms.
Version Spaces
Some definitions:
H
is the set of all possible hypotheses we are willing to considerc
is the "real" one (aka the "target concept")h
is what we're currently consideringS
is example data (for example uniformly randomly drawn from the population)Basically, its "what we haven't ruled out yet". Silly to define, but worth having a term for.
PAC Learning
More definitions:
h
. The true hypothesis would have a training error of 0.C
: Concept ClassL
: LearnerH
: Hypothesis spacen
: size of hypothesis space|H|
D
: distribution over inputsepisilon
: error goal (less than or equal to 0.5)delta
: certainty goal (with probability1-delta
, the algorithm will produce a true error less than or equal toepsilon
)Because we are always learning from samples, we can never reduce our uncertainty to zero, nor our error, and thus our goal is better phrased as looking for a Probably Approximately Correct hypothesis.
Epsilon exhaustion
A version space is epsilon exhausted iff all remaining hypothesis have low error. Then you can choose any of the hypothesis and be acceptably fine according to your chosen error rate.
In other words, it is Probably Approximately Correct.
Haussler's Theorem
This is a theorem for bounding the true error as a function of the number of examples that are drawn. We'll leave the derivation to a video if you're interested:
{% youtube TpQoiUQSPB0 %}
{% youtube KtZTIvuRdss %}
The TL;DR conclusion is that the upper bound for a version space not being epsilon exhausted after
m
samples is:So we can work out, for example, that for an input space of 10 bits (1024 possibilities), target epsilon of 0.1, delta of 0.2, we will need at least
m = 1/10 * (ln 10 + ln (1/0.2)) ~= 40
training examples, or less than 4% of the possible space.The theorem also shows what levers to pull and the consequences of those levers. For example, to reduce epsilon down to 1%, we'd then need 400 examples, or 40% of the total possible space.
We are now able to place sample complexity bounds for PAC learning with this equation.
Next in our series
Further notes on this topic:
Hopefully that was a good introduction to Computational Learning Theory. I am planning more primers and would love your feedback and questions on: