tchittesh / lzu

Code for Learning to Zoom and Unzoom (CVPR 2023)
https://tchittesh.github.io/lzu/
MIT License
47 stars 5 forks source link

Questions about lzu vs fovea #1

Closed ShenZheng2000 closed 1 year ago

ShenZheng2000 commented 1 year ago

Hello, authors!

I noted that the main difference between lzu and fovea is that lzu uses zoom-unzoom early in feature extraction, while fovea uses it later after predictions.

I understand this design allow generalization to more architectures (e.g., RPN), but I'm curious why lzu's performance is better than fovea in object detection tasks, despite using the same saliency map.

Thanks!

tchittesh commented 1 year ago

Hi! LZU also uses a higher attraction fwhm (10 vs 4), which leads to stronger magnification. It's possible that FOVEA would match or outperform LZU on object detection with better tuning.

My intuition is that the "unzoom" operation is actually quite lossy / destructive to the features. I think there's a lot of future room for improvement here, e.g. using anti-aliased resampling or attention-based warping.

ShenZheng2000 commented 1 year ago

Thank you! I have an additional query about the "unzoom" concept. In this code, it appears that the last feature, i.e., the output of the last layer, is obtained from the backbone network.

However, in Fig. 1 of the paper, the unzoom features seem to originate from early stages, as they visually resemble an image. I'm uncertain whether this is solely for visualization purposes or if you've also attempted to extract early features.

tchittesh commented 1 year ago

Oh that was just for diagram purposes. Unfortunately I don’t remember what layer I actually pulled it from or how I collapsed the channel dimension.

ShenZheng2000 commented 1 year ago

Thanks for the explanation!

I revisited the fovea paper and saw they experimented with Faster R-CNN, a two-stage detector. I'm therefore puzzled why LZU says "when there are intermediate losses, as is the case with two-stage detectors containing region proposal networks (RPNs) [21], this requires more complex modifications to the usual delta loss formulation".

Regardless of it being a one-stage or two-stage detector, can't we just unwarp the final predicted bounding boxes?

WhatsApp Image 2023-08-14 at 00 49 05


If I simplify the comparison between Fovea and LZU: LZU = feature extraction on warped images + unwarp features + bbox preds on unwarped features Fovea = feature extraction on warped images + bbox preds on warped features + unwarp bboxes

I've noticed that LZU's primary strength is its bbox predictions on unwarped features. Tools like RPNs and ROIs, which are commonly used for bbox predictions, were initially designed for unwarped images. That might be why it works better.

What do you think? Looking forward to your reply! Thanks!

tchittesh commented 1 year ago

Regardless of it being a one-stage or two-stage detector, can't we just unwarp the final predicted bounding boxes?

The difficulty is with the delta loss formulation in the RPN head. Such RPN heads regress the height/width deltas between a given anchor and the ground truth bbox. Then, they apply the predicted height/width deltas before extracting the RoI features.

The issue is that this requires transforming bboxes (either predictions or ground truth) from the original image to the warped image, which requires the inverse warp. Here's the relevant excerpt from FOVEA:

image

Tools like RPNs and ROIs, which are commonly used for bbox predictions, were initially designed for unwarped images.

I'm not really seeing any assumptions (or inductive biases) in RPNs or ROIs that would make them less effective for warped images. Happy to be proven or convinced otherwise though!