Closed mitnk closed 3 weeks ago
Not sure if there is a straightforward and more intuitive one, I got an answer from GPT as below.
To determine the impact of increasing each leaf node (a, b, c, f) in the direction of their respective gradients on $L$, we can analyze how these perturbations influence $L$. Let's denote the initial value of the leaf nodes as $a_0, b_0, c_0, \text{ and } f_0$.
Given:
$$ L = (a \cdot b + c) \cdot f $$
The gradients $\text{a.grad}, \text{b.grad}, \text{c.grad}, \text{and f.grad}$ with respect to $L$ are as follows:
$$ \text{a.grad} = \frac{\partial L}{\partial a} = b \cdot f $$
$$ \text{b.grad} = \frac{\partial L}{\partial b} = a \cdot f $$
$$ \text{c.grad} = \frac{\partial L}{\partial c} = f $$
$$ \text{f.grad} = \frac{\partial L}{\partial f} = a \cdot b + c $$
When we increment the leaf nodes slightly in the direction of their gradients:
$$ a_{\text{1}} = a_0 + 0.01 \cdot \text{a.grad} = a_0 + 0.01 \cdot (b_0 \cdot f_0) $$
$$ b_{\text{1}} = b_0 + 0.01 \cdot \text{b.grad} = b_0 + 0.01 \cdot (a_0 \cdot f_0) $$
$$ c_{\text{1}} = c_0 + 0.01 \cdot f_0 $$
$$ f_{\text{1}} = f_0 + 0.01 \cdot (a_0 \cdot b_0 + c_0) $$
Let’s compute the new value of $L$:
$$ L{\text{1}} = (a{\text{1}} \cdot b{\text{1}} + c{\text{1}}) \cdot f_{\text{1}} $$
To analyze the overall impact, let's see the difference $\Delta L = L_{\text{1}} - L_0$:
1. First-order approximation:
The first-order approximation in the Taylor series for a multivariable function is given by:
$$ \Delta L \approx \sum_i \frac{\partial L}{\partial x_i} \Delta x_i $$
where
$$x_i \in \{a, b, c, f\}$$
For small $\Delta a_i$:
$$ \Delta L \approx \text{a.grad} \cdot \Delta a + \text{b.grad} \cdot \Delta b + \text{c.grad} \cdot \Delta c + \text{f.grad} \cdot \Delta f $$
Substituting the increments:
$$ \Delta a = 0.01 \cdot a.grad = 0.01 \cdot (b_0 \cdot f_0) $$
$$ \Delta b = 0.01 \cdot b.grad = 0.01 \cdot (a_0 \cdot f_0) $$
$$ \Delta c = 0.01 \cdot c.grad = 0.01 \cdot f_0 $$
$$ \Delta f = 0.01 \cdot f.grad = 0.01 \cdot (a_0 \cdot b_0 + c_0) $$
Thus:
$$ \Delta L \approx (b_0 \cdot f_0) \cdot 0.01 \cdot (b_0 \cdot f_0) + (a_0 \cdot f_0) \cdot 0.01 \cdot (a_0 \cdot f_0) + f_0 \cdot 0.01 \cdot f_0 + (a_0 \cdot b_0 + c_0) \cdot 0.01 \cdot (a_0 \cdot b_0 + c_0) $$
$$ \Delta L \approx 0.01 \left[ (b_0 f_0)^2 + (a_0 f_0)^2 + f_0^2 + (a_0 b_0 + c_0)^2 \right] $$
Since each term in the summation is a square, and squares of real numbers are non-negative, the term inside the square brackets is non-negative. Therefore:
$$\Delta L \geq 0$$
Conclusion
Under this first-order approximation, increasing the leaf nodes in the direction of their respective gradients will either increase $L$ or leave it unchanged if all gradients are zero. Thus, $L$ will always increase or stay constant, but never decrease.
Since the grads of all the variables are to L
, increase them towards their own grad will make L
to increase, by definition: that is the definition of derivative.
Think L
as Loss
, and in this and several follow-up video sessions, we are using this root fact to reduce loss to get a better model. So L = (a * b + c) * f
, and the sign of grad of each parameter are not important. If we make their value to go down with the opposite of its grad
, we will make the Loss reduce. As shown as the image below.
Hello everyone,
Thanks karpathy for the great video!
I have a maybe math-noob question: In the video ~00:52:19, we were increasing the leaf nodes by a small bit in "the direction of gradient":
And said "because we nudged all the values in the rational gradient, we expected a less negative of L" (L from -8.0 to -7.x etc). (For reminder:
L = (a * b + c) * f
and some inputs are initialized as negative, e.g.b = -3.0
)Why is this? I mean some of the grad of the leaves (e.g. b.grad: -4, c.grad: -2) are negative, how such increases to guarantee a positive impacts on L? Any math things behind here?
(Link to the Notebook)