jialuli-luka / PanoGen

Code and Data for Paper: PanoGen: Text-Conditioned Panoramic Environment Generation for Vision-and-Language Navigation
67 stars 4 forks source link

The logits in the CLIP feature #7

Closed honghd16 closed 9 months ago

honghd16 commented 9 months ago

Hi Jialu,

Thanks for your great work. I am curious about how to get the 1000-dim logits when extracting the CLIP feature for the views. The original HAMT/DUET uses the ViT pretrained on ImageNet-1K, which has a prediction head for the 1000 classifications, so it concats a 1000-dim vector with the 768 feature. However, since CLIP does not have such a prediction head, how do I get the same 1000 logits? Because I found this is necessary when doing the MRC in pretraining and it is also in the CLIP feature you provided.

Best regards

honghd16 commented 9 months ago

Btw, I looked through the extraction code in EnvEdit and CLIP-ViL but did not find this part.

jialuli-luka commented 9 months ago

I use the 1000-dim logits from the HAMT/DUET features.

honghd16 commented 9 months ago

Thanks for your quick reply, cheers!