martinapugliese / tales-science-data

WORK UNDER RESTRUCTURING
41 stars 10 forks source link

bias and variance #65

Open martinapugliese opened 7 years ago

martinapugliese commented 7 years ago
martinapugliese commented 6 years ago

Also learning curves for diagnosing bias and variance https://gallery.mailchimp.com/dc3a7ef4d750c0abfc19202a3/files/6cba692b-290d-4c7b-93c9-04c3b6cdd96b/Ng_MLY06.pdf

martinapugliese commented 6 years ago

And also see the examples of bias and variance in the same document there, https://gallery.mailchimp.com/dc3a7ef4d750c0abfc19202a3/files/db5cc9c4-1964-420f-bce6-24835a2aa097/Ng_MLY01_05.pdf

martinapugliese commented 6 years ago

And this good page https://www.dataquest.io/blog/learning-curves-machine-learning/

martinapugliese commented 6 years ago

See Bishop for the decomposition of the error into noise, bias and variance - or wikpedia

martinapugliese commented 3 years ago

from an old notebook I had on this here:

From NG

Error to be decomposed into the two things. Bias is the error on the training set, variance is difference errors test/training sets. Tot error is the sum

TODO find more formal def

You can reduce either of them, or both (but much harder)

some bias may be unavoidable - the unavoidable bias is the optimal error rate, also called Bayes error rate; can be estimated making a human do the task, harder if it's a task even a human has no idea

variance can be reduced by having more training data, there is no unavoidable variance variance also reduced by regularization but might increase bias bias can be reduced with a more complex model (mind this can result in increased variance though, also costs more in computation)

Tradeoff

can't easily reduce both at the same time

reduce (avoidable) bias

Adding training data doesn't help

reduce variance

Learning curves

test error against training size - should decrease. If plateaus it tells adding data won't improve it anymore. Also plot training error - should increase with training size (mislabels, ambiguities...). You can get an idea of whethere adding more training data would help by the trend of these two curves. Variance is the difference between these curves, if they're far apart means adding more data may help reducing it

From Scott

Gives a conceptual and graphical (bulls-eye, very interesting, to reproduce here)

Also gives a math definition

gives an example (voting intention) that shows the concept of bias and variance (to reproduce/create a similar one),

gives an interactive example with kNN, also used for that page

gives some (kinda rigorous) suggestions for TODOs to deal with the tradeoff

References

  1. A Ng, Machine Learning Yearning, draft available here
  2. S Fortmann-Roe, Understanding the Bias-Variance Tradeoff