@pkuCactus
Hi pkuCactus, Thanks for share your code! I'm really interested in this paper.
But I still have some question about the Implementation of supervision (Y s2d) or (Y d2s).
For example, in 'd2s' path, the supervision of the first layer/stage is Y1_d2s = Y - P2 -P3 -P4 -P5, and in the code you use the loss L(P1+P2+P3+P4+P5, Y) where P2~5 is detach().
But why not get this supervision Y1_d2s directly using "Y - P2 -P3 -P4 -P5" ? i.e. implement the loss as L(P1, Y-P2-P3-P4-P5).
What I mean is that the loss function is not a simple subtraction between prediction and label, so I guess L(P1+P2+P3+P4+P5, Y) is not equivalent to L(P1, Y-P2-P3-P4-P5). Is that right?
Thanks for your answer!
@pkuCactus Hi pkuCactus, Thanks for share your code! I'm really interested in this paper. But I still have some question about the Implementation of supervision (Y s2d) or (Y d2s).
For example, in 'd2s' path, the supervision of the first layer/stage is Y1_d2s = Y - P2 -P3 -P4 -P5, and in the code you use the loss L(P1+P2+P3+P4+P5, Y) where P2~5 is detach(). But why not get this supervision Y1_d2s directly using "Y - P2 -P3 -P4 -P5" ? i.e. implement the loss as L(P1, Y-P2-P3-P4-P5). What I mean is that the loss function is not a simple subtraction between prediction and label, so I guess L(P1+P2+P3+P4+P5, Y) is not equivalent to L(P1, Y-P2-P3-P4-P5). Is that right? Thanks for your answer!