Closed luoluo123123123123 closed 6 months ago
Hello, The goal of $A_j$ is to meaningfully translate the feature embeddings. We wanted all the trainable blocks like RPN , final bbox Classifier, and regressors to use these translated features.
We used 1 A100 GPU with 40GB VRAM in all the experiments.
That means we can not do augmentation like this?
I would like to know, what is the difference between directly subtracting the words 'rain' and 'sunny' as augmentation, and translating them into Aj? I appreciate your assistance on this matter.
so the output of $\mathcal{V}^b$ corresponds to the final 512-dimensional clip embedding, $\mathcal{V}^a$ is hxwx1024 matrix, hence it is not straightforward to map to clip embedding space there
Thank you for your explanation; I completely understand now! One last question :If I use the Vit-B/16 pre-trained CLIP model as the text encoder and the ImageNet RN101 pre-trained model as the image encoder, would this approach be effective? Is this method only applicable when the source of the image encoder and text encoder is consistent?
From my understanding underlying Clip text encoder architecture i.e GPT2 is the same for ViTb16 or RN101, which are image encoders. In any case, I think consistency might be needed
Thank you for the detailed explanation. Your insights are very helpful and enlightening!
I‘m confused that,if we can consider as target image embedding,why we need to train Aj?
why not just add to Training Step? by the way,How much memory is needed to run this experiment in your setting?I would greatly appreciate any assistance you can provide.