Open zhuyu2015 opened 1 month ago
第367行是这一行:
float dL_dmd = 2.0f * (T * alpha) * (m_d * final_A - final_D) * dL_dreg;
Hi, here is the formulation for back-propagation of the depth-distortion loss:
$$L= \sum{i=0}^{N-1}\sum{j=0}^{N-1}w_iw_j(d_i-d_j)^2$$
Now we specifically analyze the $k$-th item and do some simplification,
$$ Lk = \sum{j=0}^{k-1}w_kw_j(d_k-dj)^2 + \sum{i=k+1}^{N-1}w_iw_k(d_i-d_k)^2 $$
$$ Lk = \sum{j=0}^{k-1}w_kw_j(d_k^2-2d_kd_j + dj^2) + \sum{i=k+1}^{N-1}w_iw_k(d_i^2-2d_id_k+d_k^2)$$
factorize the nested composition terms and we have got: $$dL_k / dwk = \sum{j=0}^{k-1}w_j(d_k^2-2d_kd_j+dj^2) + \sum{i=k+1}^{N-1}w_i(d_i^2-2d_id_k+d_k^2)$$
$$dL_k / dw_k = D^2-w_kd_k^2 + d_k^2(A_k-w_k) - 2dk(\sum{j=0}^{k-1}w_jdj + \sum{i=k+1}^{N-1}w_id_i +w_kd_k -w_kd_k) = D^2 +d_k^2A-2d_kD $$
$$dL_k / ddk = 2(\sum{j=0}^{k-1}w_kw_j(d_k-dj) + \sum{i=k+1}^{N-1}w_iw_k(d_k-d_i)) = 2(w_k(d_k(A-w_k) - (D-d_kw_k))) = 2(w_k(d_kA - D))$$
Hi @hbb1 The above formulation is very helpful in understanding the implementation of the backward function. May I further ask about the formulation of the normal loss?
Another quick question. When calculating the derivative, why not considering the case of d L_k / dw_m, (when m not equals to k)?
Because we will instead compute something like dL / dw_{k-1} so that the algorithm can be efficiently run back-to-front. Note that dL_k / d_wk * d_wk / d_w{k-1} = dLk / dw{k-1}
last_dL_dT = dL_dweight * alpha + (1 - alpha) * last_dL_dT;
@hbb1 Thanks a lot for your prompt reply. I tried to understand how the backward is implemented these days but still a bit confused. I would be very appreciative if you could explain a bit more.
Based on (17) $$\mathcal{L} = \sum^{N-1}_{i=0} \sum^{i-1}_j w_i w_j (d_i - d_j)^2$$.
$$w_k = \alphak \prod{i}^{k-1}(1-\alpha_i)$$ The derivative of $w_k$ is $$\frac{\partial \mathcal{L}}{\partial w_k} = \frac{\partial \mathcal{L}_k}{\partial w_k} = D^2 + d_k^2 A - 2d_kD $$ As the answer above, $\mathcal{L}_k$ is the k-th term of $$\sum^{N-1}_i \sum^{N-1}_j w_i w_j (d_i - d_j)^2 $$, where the second sum is to $N-1$ rather than to $i-1$ as (17).
The following code calculates $\frac{\partial \mathcal{L}}{\partial \alpha_k} $
dL_dalpha += dL_dweight - last_dL_dT;
// propagate the current weight W_{i} to next weight W_{i-1}
last_dL_dT = dL_dweight * alpha + (1 - alpha) * last_dL_dT;
My understanding is that
$$\frac{\partial \mathcal{L}}{\partial \alpha_k} = \sum_i^{N-1} \frac{\partial \mathcal{L}}{\partial w_i} \frac{\partial w_i}{\partial \alpha_k}$$
I know that $\frac{\partial w_i}{\partial \alpha_k}=0$ when i<k. Considering a back-to-front process, we iteratively calculate the following derivatives inside the loop for (int j = 0; !done && j < min(BLOCK_SIZE, toDo); j++)
$$\frac{\partial \mathcal{L}}{\partial \alpha{N-1}}=\frac{\partial \mathcal{L}}{\partial w{N-1}} \frac{\partial w{N-1}}{\partial \alpha{N-1}}=\frac{\partial \mathcal{L}}{\partial w{N-1}} \prod{i}^{N-2}(1-\alphai)$$
$$\frac{\partial \mathcal{L}}{\partial \alpha{N-2}}=\frac{\partial \mathcal{L}}{\partial w{N-1}} \frac{\partial w{N-1}}{\partial w{N-2}}\frac{\partial w{N-2}}{\partial \alpha{N-2}}+\frac{\partial \mathcal{L}}{\partial w{N-2}} \frac{\partial w{N-2}}{\partial \alpha{N-2}}=( \frac{\alpha{N-1}}{\alpha{N-2}}(1-\alpha{N-2})\frac{\partial \mathcal{L}}{\partial w{N-1}} +\frac{\partial \mathcal{L}}{\partial w{N-2}})\prod{i}^{N-3}(1-\alpha_i)$$
I was wondering if I made any mistakes. If not, how to get the implementation above based on the formulations. Thanks a lot.
Hi, I found I did not write it clearly. $L$ can be seen as a summation of a $N \times N$ matrix. When we need derive the gradient of $w_k$, only some entries are related. I denoted the summation of them as $L_k$. So that $\partial L / \partial w_k = \partial L_k / \partial w_k$.
Now let's think of the gradients of $a_k$, they can pass from both $L$ through $wk$ and from $w{j}$ where $j>k$ (occlusion). The $dL_d / da_k$ is compute as above and the gradients from lateral $w_k$ is computed as
dL_dalpha += dL_dweight - last_dL_dT;
# pass the next
last_dL_dT = dL_dweight * alpha + (1 - alpha) * last_dL_dT;
@hbb1 Really appreciate your prompt reply. I can understand most of the parts until the last sentence on $dL/d a_k$. Because of occlusion $\frac{\partial w_i}{\partial a_k}=0$ where $i < k$, therefore $$\frac{\partial L}{\partial ak} = \sum{i=k}^{N-1} \frac{\partial L}{\partial w_i} \frac{\partial w_i}{\partial a_k}$$ Also $$w_k = ak \prod{i=0}^{k-1}(1-a_i)$$ which means $w_k = \frac{ak(1-a{k-1})}{a{k-1}}w{k-1}$. Therefore we have $\frac{\partial wk}{\partial w{k-1}}=\frac{ak(1-a{k-1})}{a_{k-1}}$ and $\frac{\partial w_k}{\partial a_k}=\prod(1-ai)=T{k-1}$. Considering a back-to-front process and starting on the final $n$-th Gaussian: $$\frac{\partial L}{\partial a_n}=\frac{\partial L}{\partial w_n} \frac{\partial w_n}{\partial a_n}=\frac{\partial L}{\partial wn}T{n-1}$$ $$\frac{\partial L}{\partial a{n-1}}= \frac{\partial L}{\partial w{n-1}} \frac{\partial w{n-1}}{\partial a{n-1}} + \frac{\partial L}{\partial w_n} \frac{\partial wn}{\partial w{n-1}} \frac{\partial w{n-1}}{\partial a{n-1}}=(\frac{\partial L}{\partial w_{n-1}} - \frac{an(a{n-1}-1)}{a{n-1}}\frac{\partial L}{\partial w{n}} )T_{n-1}$$.
Based on the code
dL_dalpha += dL_dweight - last_dL_dT;
# pass the next
last_dL_dT = dL_dweight * alpha + (1 - alpha) * last_dL_dT;
Denoting last_dL_dT
as $\frac{d L}{d T_{k+1}}$, and ignoring $Tk$, within the loop:
$$\frac{d L}{d T{n+1}}=0$$
$$\frac{\partial L}{\partial a_n}=\frac{\partial L}{\partial wn} - \frac{d L}{d T{n+1}} =\frac{\partial L}{\partial wn} $$
$$\frac{d L}{d T{n}}=\frac{\partial L}{\partial w_n}a_n+(1-an)\frac{d L}{d T{n+1}}=a_n\frac{\partial L}{\partial wn}$$
$$\frac{\partial L}{\partial a{n-1}}= \frac{\partial L}{\partial w{n-1}} - \frac{d L}{d T{n}} = \frac{\partial L}{\partial w_{n-1}} - a_n\frac{\partial L}{\partial w_n}$$.
I find $\frac{an}{a{n-1}}\frac{\partial L}{\partial w_{n}}$ is missing.
May I know if I made any mistake? Thanks a lot.
Thanks guys for the detailed equations. It really helped me better understand the computations.
In short, the backward computation here actually is very similar to the $\frac{\partial L}{\partial \alpha}$ for the color channels, the only difference is that last_color
is replaced by dL_dweight
. We don't really need things like $\frac{\partial wn}{\partial w{n-1}}$, we just need the explicit form of every term in
$$\sum_{i=k}^{n} \frac{\partial w_i}{\partial \alpha_k}.$$
And this can be implemented recursively if we do it back-to-front.
@YanhaoZhang I think the problem is that in this equation:
$$\frac{\partial L}{\partial a{n-1}}= \frac{\partial L}{\partial w{n-1}} \frac{\partial w{n-1}}{\partial a{n-1}} + \frac{\partial L}{\partial w_n} \frac{\partial wn}{\partial w{n-1}} \frac{\partial w{n-1}}{\partial a{n-1}}=(\frac{\partial L}{\partial w_{n-1}} - \frac{an(a{n-1}-1)}{a{n-1}}\frac{\partial L}{\partial w{n}} )T_{n-1},$$
this part
$$\frac{\partial L}{\partial w_n} \frac{\partial wn}{\partial w{n-1}} \frac{\partial w{n-1}}{\partial a{n-1}} $$
is incomplete if we want the gradient of $\alpha_{n-1}$ from $w_n$, because
$$wn = w{n-1}\frac{\alphan}{\alpha{n-1}}(1-\alpha{n-1})=w{n-1}(\frac{\alphan}{\alpha{n-1}}-\alpha_n).$$
So, actually, the gradient of $\alpha_{n-1}$ from $w_n$ is
$$\frac{\partial L}{\partial w_n} \big( \frac{\partial wn}{\partial w{n-1}} \frac{\partial w{n-1}}{\partial a{n-1}} + \frac{\partial w_n}{\partial \frac{\alphan}{\alpha{n-1}}-\alpha_n} \frac{\partial \frac{\alphan}{\alpha{n-1}}-\alphan}{\partial a{n-1}}\big),$$
which equals to
$$\frac{\partial L}{\partial w_n} \big( \frac{\alphan}{\alpha{n-1}} (1-\alpha{n-1}) T{n-1} + w_{n-1} (-\frac{\alphan}{\alpha{n-1}^2})\big),$$
where the first part is what you got. But we still have the second part, after adding which, using $w{n-1}=\alpha{n-1}T_{n-1}$, we get
$$-\frac{\partial L}{\partial w_n} \alphanT{n-1}.$$
Then we have that
$$\frac{\partial L}{\partial a{n-1}} = (\frac{\partial L}{\partial w{n-1}} - \alphan\frac{\partial L}{\partial w{n}} )T_{n-1},$$
which is consistent with the code, and it is correct.
@ranrandy Hi, thanks a lot for your explanation. Now it is clear to me how the derivative is obtained. Appreciate it!
Decent and rigorous mathematical derivation! From my side, I derive it from the perspective of sequence to sequence {a_k} -> {T_k} -> {w_k}
, using my knowledge like BPTT (back-propagation through time) from RNN. @ranrandy makes this more clear!
Thanks everyone for involving and discussion.
尤其367行的final_A和final_D要随循环的变化而变化吗?