HAMT features for R2R - Githubissues

wz0919 commented 2 years ago

Hi Jialu,

Thanks for your great work!

I'm reproducing your HAMT results on R2R dataset (Table 2 in the paper). I'm wondering what's the visual encoder you used? In README you said "On Room-to-Room dataset, the features for HAMT are extracted with the visual backbone in the pre-trained model (not fine-tuned on VLN task) released in HAMT" so seems the encoder should be the fine-tuned ViT in stage 2 of the pretraining process of HAMT (I think it's from datasets/R2R/trained_models/vitbase-6tasks-pretrain-e2e/model_step_22000.pt released in HAMT, which is the tuned imagenet-pretrained ViT during pretraining since row 1 in table 2 is the IL+RL fine-tuned result using this encoder). However, in your paper, you said you extracted features from CLIP ViT-B/16 as well as the released HAMT features in this repo, so it's quite confusing for me which encoder you used.

Could you give some implementation details and suggestions for reproducing the results in Table 2?

Many thanks!

Zun

jialuli-luka commented 2 years ago

Hi Zun,

I'm sorry for the confusion. We use the features extracted with the visual backbone in the HAMT pre-trained model, which is extracted with datasets/R2R/trained_models/vitbase-6tasks-pretrain-e2e/model_step_22000.pt. I just update the repo and add the code that we use for HAMT.

Please let me know if you have further questions.

Jialu

wz0919 commented 2 years ago

Hi Zun,

I'm sorry for the confusion. We use the features extracted with the visual backbone in the HAMT pre-trained model, which is extracted with datasets/R2R/trained_models/vitbase-6tasks-pretrain-e2e/model_step_22000.pt. I just update the repo and add the code that we use for HAMT.

Please let me know if you have further questions.

Jialu

Many thanks for your quick reply! I'll try the code and features!

jialuli-luka / EnvEdit

HAMT features for R2R #1