swyxio / swyxdotio

This is the repo for swyx's blog - Blog content is created in github issues, then posted on swyx.io as blog pages! Comment/watch to follow along my blog within GitHub
https://swyx.io
MIT License
342 stars 45 forks source link

Supervised Learning: Computational Learning Theory #242

Closed swyxio closed 2 years ago

swyxio commented 2 years ago

source: devto devToUrl: "https://dev.to/swyx/supervised-learning-computational-learning-theory-160h" devToReactions: 15 devToReadingTime: 6 devToPublishedAt: "2019-02-15T07:03:05.570Z" devToViewsCount: 541 title: "Supervised Learning: Computational Learning Theory" published: true category: note description: What's the big O of machine learning? Lets put some formal theory around HOW we learn! tags: Machine Learning, Supervised Learning

This is the 8th in a series of class notes as I go through the Georgia Tech/Udacity Machine Learning course. The class textbook is Machine Learning by Tom Mitchell.

⚠️ This chapter is a LOT more theoretical than the others but set up the theoretical foundations for machine learning - skip or persevere, your choice but be warned

We've learned about various machine learning algorithms - but what do we know about evaluating algorithms and the theory of learning itself? Here, we want to:

The Core Question

What's the Big O of machine learning?

We know about Big O of regular algorithms from Complexity Theory, to estimate how resource consumption scales with the problem size (e.g. for space and time). With ML, those constraints are also true, but the new limited resource is data.

Resource usage in ML

Much like regular algorithms have a Big(O), ML algorithms also have Big(O)s for time and space, but also data consumption.

When we evaluate algorithms, we might want to know things like:

  1. probability of successful training (1 - delta)
  2. number of examples to train on (m or n)
  3. complexity of hypothesis class (complexity of H)
  4. Accuracy we are approximating target concept (Epsilon)
  5. presenting training examples (big batch, or incremental/iterative)
  6. selecting training examples (explained below)

Training Examples aren't always random

Having control over the questions being asked (and therefore the examples to train on) can vary data consumption wildly. For example:

In practice, most training examples are randomly distributed. But it is worth considering how efficiently we can learn from other distributions.

An additional consideration is balance. We learn most from positive examples ("Absence of evidence is not evidence of absence"), so if those are few and far between then we will need a LOT of examples to learn the right concept (aka falsify wrong hypotheses).

Mistake Bounds

Instead of asking how accurate our learning is to the "real" underlying concept, we can flip the script and try to set an upper bound on the mistakes we make while learning. This is a fascinating idea for linear binary classification models with no noise.

For example, let's say we have a possibility space of K bits (1 or 0). So 10 bits have 1024 (2 ^10) possibilities. We start by extreme overfitting to the first example we see (so that the first example we saw was the ONLY possible result from our current hypothesis). Then we continually relax the conditions (widening the hypothesis) as we come across more examples. Because each bit only has three possible states (positive, negative, and absent), new examples which conflict with our hypothesis must not matter to the answer, and so we can switch them to absent.

In this way we guarantee that our learning will never make more than K+1 mistakes, for a 2^K possibility space. A cute trick, but it doesn't quite put upper bounds on how many examples we still need, which is the original question.

Questions revisited

We now have fleshed out some competing measures of "big O for Machine Learning":

For practical purposes we will ultimately just focus on Sample Complexity, but it is worth considering these other forms.

Version Spaces

Some definitions:

Basically, its "what we haven't ruled out yet". Silly to define, but worth having a term for.

PAC Learning

More definitions:

https://slideplayer.com/slide/4983569/16/images/5/Protocol+Given%3A+Learner+observes+sample+S+%3D+%7B+%EF%83%A1+xi%2C+c%28xi%29+%EF%83%B1+%7D.jpg

Because we are always learning from samples, we can never reduce our uncertainty to zero, nor our error, and thus our goal is better phrased as looking for a Probably Approximately Correct hypothesis.

Formally: C is said to be PAC-learnable by L using H if and only if L will, with probability 1-delta, output a h from H such that error(h) < epsilon in time and samples polynomial in 1/epsilon, 1/delta, and n.

Epsilon exhaustion

A version space is epsilon exhausted iff all remaining hypothesis have low error. Then you can choose any of the hypothesis and be acceptably fine according to your chosen error rate.

In other words, it is Probably Approximately Correct.

Haussler's Theorem

This is a theorem for bounding the true error as a function of the number of examples that are drawn. We'll leave the derivation to a video if you're interested:

{% youtube TpQoiUQSPB0 %}

{% youtube KtZTIvuRdss %}

The TL;DR conclusion is that the upper bound for a version space not being epsilon exhausted after m samples is:

m >= 1/epsilon * (ln n + ln (1/delta))

So we can work out, for example, that for an input space of 10 bits (1024 possibilities), target epsilon of 0.1, delta of 0.2, we will need at least m = 1/10 * (ln 10 + ln (1/0.2)) ~= 40 training examples, or less than 4% of the possible space.

The theorem also shows what levers to pull and the consequences of those levers. For example, to reduce epsilon down to 1%, we'd then need 400 examples, or 40% of the total possible space.

We are now able to place sample complexity bounds for PAC learning with this equation.

Next in our series

Further notes on this topic:

Hopefully that was a good introduction to Computational Learning Theory. I am planning more primers and would love your feedback and questions on: