oldoc63 / learningDS

Learning DS with Codecademy and Books
0 stars 0 forks source link

Finding the "best" line #467

Open oldoc63 opened 1 year ago

oldoc63 commented 1 year ago

In the last exercise, we tried to eye-ball what the best-fit line might look like. In order to actually choose a line, we need to come up with some criteria for what best actually means.

Depending on our ultimate goals and data, we might choose different criteria; however, a common choice for linear regression is ordinary least squares (OLS). In simple OLS regression, we assume that the relationship between two variables x and y can be modeled as: $y = mx + b + error$.

We define "best" as the line that minimizes the total squared error for all data points. This total squared error is called the loss function in machine learning.

Image

oldoc63 commented 1 year ago

In this plot, we see two points on either side of the line. One of the points is one unit below the line (labeled -1). The other point is three units above the line (labeled 3). The total squared error (loss) is: $loss = (-1)^2+(3)^2=1+9=10$.

Notice that we square each individual distance so that points below and above the line contribute equally to loss (when we square a negative number, the result is positive). To find the best fit line, we need to find the slope and intercept of the line that minimizes loss.

oldoc63 commented 1 year ago

https://content.codecademy.com/programs/data-science-path/line-fitter/line-fitter.html

oldoc63 commented 1 year ago

The interactive visualization in the browser lets you try to find the line of best fit for a random set of data points:

You can see the total loss on the right side of the visualization. To get the line of best fit, we want this loss to be as small as posible.

To check if you got the best line, check the "Plot-Best-Fit" box.

Randomize a new set of points and try to fit a new line by entering the number of points you want (try 8!) and pressing Randomize Points.