Is the pretrained teacher module available?

st2yang commented 4 years ago

Hi Pooya,

Thanks for releasing your code.

Though the processed attention map is released, I am wondering if the teacher module available? I'd like to apply it in my project and I'd appreciate it you're willing to share.

Also, I'm curious how the teacher module would react if there's a red bow and blue box in the image while the command is "Pick up red bowl". Would the teacher be module to highlight the red red bow while ignore the disturbance(blue box here)?

Thanks, Yang

pouyaAB commented 4 years ago

Hi Yang,

Regarding your question about the teacher network's behavior, I should refer you to our latest paper and its video. There we discuss how we used a data augmentation technique so that the attention module can do what you described. In this repository, the teacher network is not able to do what you described, here our definition of clutter doesn't include objects from the training set. In our latest work though it does. So please check out our latest work. I will release its repository soon. The attention module architecture in the new paper is closely related to the teacher network in this work. I will update this thread as soon as I released the code for my latest work probably in two weeks.

st2yang commented 4 years ago

Hi Pooya,

Your new paper is very interesting and the attention module matches my required features a lot. I'm looking forward to the code releasing. Thank you in advance!

Best, Yang

pouyaAB commented 4 years ago

Please take look at my new repository. I will update the Readme soon. But you can check out the code. Take a look at the sample outputs in the sample folder. You can download pre-trained models from here.

st2yang commented 4 years ago

Hi Pooya,

Thanks a lot for sharing the code. I appreciate it!

I'll check out the code carefully. Can I give me a quick guide to where the attention map is generated? I am particularly interested in attention <- f_v(O,T) part.

BTW, I am doing a toy experiment to encode text and visual to generate the attention map of the target in clutter, i.e., attention <- f_v(O,T). Do you think if it is possible to use target mask as the supervision to achieve this? I am a little stuck on this...

Thanks lot! Yang

pouyaAB commented 4 years ago

Take a look at the class Encoder_text_tower inside the file autoencoders/reduction_att.py. This class contains the implementation of the attention module described in the paper. it receives batches of images at their corresponding one-hot vectors describing the target object as input and it will output the attention map and result of the classification of the image based on the pooled features by the attention map. The classification error will force the network to put attention on the target object. 1 What you described should be achievable by this approach.

st2yang commented 4 years ago

Just to confirm, I can take your attention module (take text command and image containing clutter) and output the attention map of the target object to form a classification problem while the ground-truth target mask is used as the target supervision. Correct?

pouyaAB commented 4 years ago

Well not exactly. Our model doesn't need a ground-truth target mask. So the attention modules extract feature from both the image and two one-hot vectors describing the target object's shape and color (e.g. red bowl). by mixing these features we extract the attention map. We then use the attention map to pool features from an intermediate feature map of the CNN's. If the attention focuses on part of the image related to the target object, it should be able to use these pooled features and classify the image based on the target object's shape and color. Read the new paper for details. The loss function is the classification loss between the prediction and ground truth shape and color of the target object plus l1-norm of the attention map to incentivize the network to select a few regions in the image.

st2yang commented 4 years ago

Hi Pooya,

I read your two papers in detail. And I will definitely try it out.

But before that, I would like to adapt your attention module in a toy experiment, i.e., using your attention module and training it under the supervision of the target mask, since I guess the target mask should be a pretty strong supervision. This is what I was describing in the last two comments. Is it possible?

Thanks, Yang

pouyaAB commented 4 years ago

Yeah if you have ground truth for the attention mask. You don't even need the classification part most probably. So yeah it should work.

st2yang commented 4 years ago

Hi Pooya,

I read and compare your two papers one more time. Here is my understanding about "clutter" part (1) Based on the cvpr paper, the new paper uses data augmentation to create clutter dataset, and uses reconstruction (DA->O,EAC->EA) to force to attention module focus on the target. (2) The data augmentation and reconstruction trick are due to the fact "we don't have to manually collect and label cluttered image". But, if we can create cluttered dataset and get ground-truth target mask in simulation, with the attention module, we can achieve the same effect, i.e., generate the attention map for the target in clutter.

Is my understanding correct?

Also, I notice you dropped the lstm for text encoding in the new paper and use fc layer instead. Can I ask why? Because lstm is not necessary, or lstm is more tractable?

Thanks, Yang

pouyaAB commented 4 years ago

Well, correct. If you gather cluttered dataset and GT for target mask you should be able to achieve the same effect. However, one advantage that using data augmentation has (as you mentioned) is the reconstruction error between the (DA->D, EAC->EA). The reconstruction loss was suitable in our problem since we were trying to completely block the flow of information from the unrelated parts of the image but we still needed the robot's position. The reconstruction loss forces the attention module to include the robot in the attention map. completely blocking the flow of information was not possible in the CVPR paper, because we still used the original image as the encoder's input to have the robot's position. This is another reason the new paper has a significant jump in the success rate even in benign conditions. But this might not be the case for your problem if only the target object is enough for your application. Or you might be able to generate the corresponding pairs (DA->D, EAC->EA) of images in the simulation.

Regarding the LSTM, since we don't have many diverse sets of sentences in our dataset (at most 13) and each sentence had at most 5 words in it, using the LSTM to process them was overkill. LSTM might be a better choice when there are longer more diverse sentences.

pouyaAB / Pay-Attention

Is the pretrained teacher module available? #1