jialuli-luka / EnvEdit

Pytorch Code and Data for EnvEdit: Environment Editing for Vision-and-Language Navigation (CVPR 2022)
MIT License
32 stars 1 forks source link

Question about the precompute of CLIP feature #3

Closed MarSaKi closed 2 years ago

MarSaKi commented 2 years ago

Hi jialu,

Thanks for your great work! I have some questions about the precomputing of your CLIP features. Could you please give me some hints? I want to know which version of CLIP did you used, is this one? If the one above, I found my computed features is different from your CLIP-ViT-B-16. They are all 512 dim, but some numerical differences. Like this: Your CLIP-ViT-B-16: array([ 0.63720703, -0.49316406, 0.14416504, 0.00839233, -0.6166992 , 0.07427979, 0.09985352, 0.32763672, 0.20617676, 0.08319092], dtype=float32) My CLIP-ViT-B-16: array([ 0.69677734, -0.6723633 , 0.14343262, 0.04043579, -0.52441406, 0.15881348, 0.0838623 , 0.24719238, 0.19128418, 0.0085144 ], dtype=float32) Do you have any idea about this phenomenon?

Besides, did you use this scripts in CLIP-ViL to precompute the feature?

Thanks for your attention to this matter!

jialuli-luka commented 2 years ago

Hi, Thanks for your interest in EnvEdit! I just update the git repo with the CLIP feature extraction code. We didn't specify the CLIP version, so it should be the default version from the CLIP repository. The code we use to extract CLIP features is almost the same as in CLIP-ViL-VLN , except that we load the images from files directly instead of using the simulator. Best, Jialu

MarSaKi commented 2 years ago

Many thanks for your quick reply!