Open luckyxiaoqi opened 3 years ago
VTN is less generalizable to deeper cascades, because every cascade struggles to make the current warped image (instead of the final warped image) and the fixed image as similar as possible because of the similarity loss. This is indeed a better strategy than training each cascade one by one but still less cooperative than optimizing a single final objective.
IN your paper ,you wrote "DLIR [19] and VTN [37] also stack their networks, though both limited to a small number of cascades. DLIR trains each cascade one by one, i.e., after fixing the weights of previous cascades. VTN jointly trains the cascades, while all successively warped images are measured by the similarity compared to the fixed image. Neither training method allows intermediate cascades to progressively register a pair of images. Those noncooperative cascades learn their own objectives regardless of the existence of others, and thus further improvement can hardly be achieved even if more cascades are conducted. They may realize that network cascading possibly solves this problem, but there is no effective way of training deep network cascades for progressive alignments." "The difference between our architecture and existing cascading methods is that each of our cascades commonly takes the current warped image and the fixed image as inputs (in contrast to [30, 45]) and the similarity is only measured on the final warped image (in contrast to [19, 37]), enabling all cascades to learn progressive alignments cooperatively."
I can't understand "Neither training method allows intermediate cascades to progressively register a pair of images“。VTN's input is also two image, why it can't progressively register a pair of images, just beacuse it add similarity loss?
VTN's subnetworks can also cooperate, why you wrote "Those noncooperative cascades learn their own objectives regardless of the existence of others".