Closed wz0919 closed 2 years ago
Hi Zun,
I'm sorry for the confusion. We use the features extracted with the visual backbone in the HAMT pre-trained model, which is extracted with datasets/R2R/trained_models/vitbase-6tasks-pretrain-e2e/model_step_22000.pt
. I just update the repo and add the code that we use for HAMT.
Please let me know if you have further questions.
Jialu
Hi Zun,
I'm sorry for the confusion. We use the features extracted with the visual backbone in the HAMT pre-trained model, which is extracted with
datasets/R2R/trained_models/vitbase-6tasks-pretrain-e2e/model_step_22000.pt
. I just update the repo and add the code that we use for HAMT.Please let me know if you have further questions.
Jialu
Many thanks for your quick reply! I'll try the code and features!
Hi Jialu,
Thanks for your great work!
I'm reproducing your HAMT results on R2R dataset (Table 2 in the paper). I'm wondering what's the visual encoder you used? In README you said "On Room-to-Room dataset, the features for HAMT are extracted with the visual backbone in the pre-trained model (not fine-tuned on VLN task) released in HAMT" so seems the encoder should be the fine-tuned ViT in stage 2 of the pretraining process of HAMT (I think it's from
datasets/R2R/trained_models/vitbase-6tasks-pretrain-e2e/model_step_22000.pt
released in HAMT, which is the tuned imagenet-pretrained ViT during pretraining since row 1 in table 2 is the IL+RL fine-tuned result using this encoder). However, in your paper, you said you extracted features from CLIP ViT-B/16 as well as the released HAMT features in this repo, so it's quite confusing for me which encoder you used.Could you give some implementation details and suggestions for reproducing the results in Table 2?
Many thanks!
Zun