hekj / FDA

Official Implementation of Frequency-enhanced Data Augmentation for Vision-and-Language Navigation (NeurIPS2023)
13 stars 0 forks source link

Some Questions About Visual-Textual Matching #2

Open zhangpingrui opened 7 months ago

zhangpingrui commented 7 months ago

Hi, I have read the paper and the FDA code. It's really a novel work with a nice perspective.

In paper I found that this work claims that using high-frequency can help the visual-textual matching, but I don't clearly know which experiment in the paper supports this view point.

And I am really interested in how to improve visual-textual matching capacity in VLN task.

shonnon-zxs commented 7 months ago

The method described in this article involves enhancing original images by adding high-frequency information from other random images, resulting in augmented images that serve as new samples. These augmented images are then alternately trained with the original samples. The underlying motivation is that VLN models are sensitive to high-frequency information. This method enables VLN models to focus more accurately on the original high-frequency information. Essentially, it allows the model to overcome the interference of high-frequency noise from random images and to recognize the correct high-frequency information in the original images, which thus achieves visual-textual matching. (I guessed it)

zhangpingrui commented 7 months ago

The method described in this article involves enhancing original images by adding high-frequency information from other random images, resulting in augmented images that serve as new samples. These augmented images are then alternately trained with the original samples. The underlying motivation is that VLN models are sensitive to high-frequency information. This method enables VLN models to focus more accurately on the original high-frequency information. Essentially, it allows the model to overcome the interference of high-frequency noise from random images and to recognize the correct high-frequency information in the original images, which thus achieves visual-textual matching. (I guessed it)

Yep. But I think the high-frequency mix method can improve the performance because it makes the model more generalized. And mix high-frequency is a common method in domain generalization area, such as A Fourier-based Framework for Domain Generalization did. So I think the main impact of mix-method is improving generalization ability.

So I don't explicitly get how high-frequency information helps model do visual-textual matching.