karpathy / nn-zero-to-hero

Neural Networks: Zero to Hero
MIT License
10.9k stars 1.33k forks source link

Question on Simple Optimization Step #54

Closed mitnk closed 3 weeks ago

mitnk commented 3 weeks ago

Hello everyone,

Thanks karpathy for the great video!

I have a maybe math-noob question: In the video ~00:52:19, we were increasing the leaf nodes by a small bit in "the direction of gradient":

a.data += 0.01 * a.grad
b.data += 0.01 * b.grad
c.data += 0.01 * c.grad
f.data += 0.01 * f.grad

And said "because we nudged all the values in the rational gradient, we expected a less negative of L" (L from -8.0 to -7.x etc). (For reminder: L = (a * b + c) * f and some inputs are initialized as negative, e.g. b = -3.0)

Why is this? I mean some of the grad of the leaves (e.g. b.grad: -4, c.grad: -2) are negative, how such increases to guarantee a positive impacts on L? Any math things behind here?

(Link to the Notebook)

mitnk commented 3 weeks ago

Not sure if there is a straightforward and more intuitive one, I got an answer from GPT as below.


To determine the impact of increasing each leaf node (a, b, c, f) in the direction of their respective gradients on $L$, we can analyze how these perturbations influence $L$. Let's denote the initial value of the leaf nodes as $a_0, b_0, c_0, \text{ and } f_0$.

Given:

$$ L = (a \cdot b + c) \cdot f $$

The gradients $\text{a.grad}, \text{b.grad}, \text{c.grad}, \text{and f.grad}$ with respect to $L$ are as follows:

$$ \text{a.grad} = \frac{\partial L}{\partial a} = b \cdot f $$

$$ \text{b.grad} = \frac{\partial L}{\partial b} = a \cdot f $$

$$ \text{c.grad} = \frac{\partial L}{\partial c} = f $$

$$ \text{f.grad} = \frac{\partial L}{\partial f} = a \cdot b + c $$

When we increment the leaf nodes slightly in the direction of their gradients:

$$ a_{\text{1}} = a_0 + 0.01 \cdot \text{a.grad} = a_0 + 0.01 \cdot (b_0 \cdot f_0) $$

$$ b_{\text{1}} = b_0 + 0.01 \cdot \text{b.grad} = b_0 + 0.01 \cdot (a_0 \cdot f_0) $$

$$ c_{\text{1}} = c_0 + 0.01 \cdot f_0 $$

$$ f_{\text{1}} = f_0 + 0.01 \cdot (a_0 \cdot b_0 + c_0) $$

Let’s compute the new value of $L$:

$$ L{\text{1}} = (a{\text{1}} \cdot b{\text{1}} + c{\text{1}}) \cdot f_{\text{1}} $$

To analyze the overall impact, let's see the difference $\Delta L = L_{\text{1}} - L_0$:

1. First-order approximation:

The first-order approximation in the Taylor series for a multivariable function is given by:

$$ \Delta L \approx \sum_i \frac{\partial L}{\partial x_i} \Delta x_i $$

where

$$x_i \in \{a, b, c, f\}$$

For small $\Delta a_i$:

$$ \Delta L \approx \text{a.grad} \cdot \Delta a + \text{b.grad} \cdot \Delta b + \text{c.grad} \cdot \Delta c + \text{f.grad} \cdot \Delta f $$

Substituting the increments:

$$ \Delta a = 0.01 \cdot a.grad = 0.01 \cdot (b_0 \cdot f_0) $$

$$ \Delta b = 0.01 \cdot b.grad = 0.01 \cdot (a_0 \cdot f_0) $$

$$ \Delta c = 0.01 \cdot c.grad = 0.01 \cdot f_0 $$

$$ \Delta f = 0.01 \cdot f.grad = 0.01 \cdot (a_0 \cdot b_0 + c_0) $$

Thus:

$$ \Delta L \approx (b_0 \cdot f_0) \cdot 0.01 \cdot (b_0 \cdot f_0) + (a_0 \cdot f_0) \cdot 0.01 \cdot (a_0 \cdot f_0) + f_0 \cdot 0.01 \cdot f_0 + (a_0 \cdot b_0 + c_0) \cdot 0.01 \cdot (a_0 \cdot b_0 + c_0) $$

$$ \Delta L \approx 0.01 \left[ (b_0 f_0)^2 + (a_0 f_0)^2 + f_0^2 + (a_0 b_0 + c_0)^2 \right] $$

Since each term in the summation is a square, and squares of real numbers are non-negative, the term inside the square brackets is non-negative. Therefore:

$$\Delta L \geq 0$$

Conclusion

Under this first-order approximation, increasing the leaf nodes in the direction of their respective gradients will either increase $L$ or leave it unchanged if all gradients are zero. Thus, $L$ will always increase or stay constant, but never decrease.

mitnk commented 3 days ago

Since the grads of all the variables are to L, increase them towards their own grad will make L to increase, by definition: that is the definition of derivative.

Think L as Loss, and in this and several follow-up video sessions, we are using this root fact to reduce loss to get a better model. So L = (a * b + c) * f, and the sign of grad of each parameter are not important. If we make their value to go down with the opposite of its grad, we will make the Loss reduce. As shown as the image below.

image