vidit09 / domaingen

CLIP the Gap CVPR 2023
67 stars 7 forks source link

Question about Semantic Augmentation #19

Closed luoluo123123123123 closed 6 months ago

luoluo123123123123 commented 6 months ago

I‘m confused that,if we can consider image as target image embedding,why we need to train Aj?

why not just add image to Training Step? by the way,How much memory is needed to run this experiment in your setting?I would greatly appreciate any assistance you can provide.

vidit09 commented 6 months ago

Hello, The goal of $A_j$ is to meaningfully translate the feature embeddings. We wanted all the trainable blocks like RPN , final bbox Classifier, and regressors to use these translated features.

We used 1 A100 GPU with 40GB VRAM in all the experiments.

luoluo123123123123 commented 6 months ago

image That means we can not do augmentation like this?

luoluo123123123123 commented 6 months ago

I would like to know, what is the difference between directly subtracting the words 'rain' and 'sunny' as augmentation, and translating them into Aj? I appreciate your assistance on this matter.

vidit09 commented 6 months ago

so the output of $\mathcal{V}^b$ corresponds to the final 512-dimensional clip embedding, $\mathcal{V}^a$ is hxwx1024 matrix, hence it is not straightforward to map to clip embedding space there

luoluo123123123123 commented 6 months ago

Thank you for your explanation; I completely understand now! One last question :If I use the Vit-B/16 pre-trained CLIP model as the text encoder and the ImageNet RN101 pre-trained model as the image encoder, would this approach be effective? Is this method only applicable when the source of the image encoder and text encoder is consistent?

vidit09 commented 6 months ago

From my understanding underlying Clip text encoder architecture i.e GPT2 is the same for ViTb16 or RN101, which are image encoders. In any case, I think consistency might be needed

luoluo123123123123 commented 6 months ago

Thank you for the detailed explanation. Your insights are very helpful and enlightening!