Thanks for sharing the code. However, I'm quite confused for the code of QGM as the naming of the code is a little different from the original paper(if I understand it correctly...)
I think the code for that module is defined in function lang_tf_enc of model/transformer_model.py
As the figure 4 suggests, the input vision features should be the raw vision features extracted from the vision backbone network. Yet the input for this function is features fused by vision & language features Fm_query(in function make_multitask_braches of model/vlt_model.py):
Thanks for sharing the code. However, I'm quite confused for the code of QGM as the naming of the code is a little different from the original paper(if I understand it correctly...)
I think the code for that module is defined in function lang_tf_enc of model/transformer_model.py
As the figure 4 suggests, the input vision features should be the raw vision features extracted from the vision backbone network. Yet the input for this function is features fused by vision & language features Fm_query(in function make_multitask_braches of model/vlt_model.py):
Can you tell me if I got it wrong? Thanks for your great patience.