Brief note for "Chapter 8. Gradient Descent"

Just a minor comment extending the reason of using grad = [2 * error * x, 2 * error] in function linear_gradient ("Chapter 8. Gradient Descent" section "Using Gradient Descent to Fit Models"). Find the note included below in the python comments. I had to spend a bit of time to find out why the gradient was calculated like that, so I hope anyone finds it useful.

def linear_gradient(x: float, y: float, theta: Vector) -> Vector:
    slope, intercept = theta
    predicted = slope * x + intercept   # The prediction of the model.
    error = (predicted - y)             # error is (predicted - actual)
    squared_error = error ** 2          # We'll minimize squared error (e_sq), which depends on the current guess values for slope (m) and intercept (n).
                                        # e_sq(m,n) = error^2 (y_predicted - y_actual)^2 = (m*x + n - y_actual)^2 
    grad = [2 * error * x, 2 * error]   # using its gradient, whose partial derivatives are d(e_sq)/dm = 2*error*m and d(e_sq)/dn = 2*error ( applying derivative rule for exponential: (f^2)' = 2*f*f' )
    return grad

joelgrus / data-science-from-scratch

Brief note for "Chapter 8. Gradient Descent" #125