Open yiranyyu opened 2 years ago
I downloaded the model weight pre-trained on VG&COCO and pre-processed features following the instruction in README. Then I tested the zero-shot grounding performance of VL-T5 on RefCOCOg dataset following the guidance. However the performance on val and test split are both zero, which really confuse me.
Then I tested the few-shot performance with VL-T5 and get reasonable result (44.53% acc on val split with four samples). I was wondering if it is the weight not used (see the log in below) when initializing RefCOCOModel from pre-trained weight that cause such big gap between the zero-shot performance and few-shot performance?
Command to Reproduce the Results
cd VL-T5/ # modify scripts/RefCOCOg_VLT5.sh to set the `lr` param to 0, set epoch to 1 vim scripts/RefCOCOg_VLT5.sh # modify #304 of src/refcoco from `>` to `>=` to save the zero acc checkpoint for testing vim src/refcoco.py # run the training script cd VL-T5/ bash scripts/RefCOCOg_VLT5.sh 4
Logs and Other Information
Log
Building Model at GPU 0 Building Model at GPU 3 Building Model at GPU 1 Building Model at GPU 2 Some weights of VLT5RefCOCO were not initialized from the model checkpoint at t5-base and are newly initialized: ['encoder.visual_embedding.feat_embedding.0.weight', 'encoder.visual_embedding.feat_embedding.0.bias', 'encoder.visual_embedding.absolute_vis_pos_embedding.0.weight', 'encoder.visual_embedding.absolute_vis_pos_embedding.0.bias', 'encoder.visual_embedding.obj_order_embedding.weight', 'encoder.visual_embedding.img_order_embedding.weight', 'encoder.visual_embedding.layer_norm.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Model Launching at GPU 3 Model Launching at GPU 1 Model Launching at GPU 2 Model loaded from snap/pretrain/VLT5/Epoch30.pth _IncompatibleKeys(missing_keys=[], unexpected_keys=['encoder.visual_embedding.feat_embedding.1.weight', 'encoder.visual_embedding.absolute_vis_pos_embedding.1.weight'])
Script
Content of
scripts/RefCOCOg_VLT5.sh
(onlylr
andepochs
params changed):# The name of experiment name=VLT5 output=snap/refcocog/$name PYTHONPATH=$PYTHONPATH:./src \ python -m torch.distributed.launch \ --nproc_per_node=$1 \ src/refcoco.py \ --distributed --multiGPU \ --train train \ --valid val \ --test test \ --optim adamw \ --warmup_ratio 0.1 \ --clip_grad_norm 5 \ --lr 0e-5 \ --epochs 1 \ --num_workers 4 \ --backbone 't5-base' \ --output $output ${@:2} \ --load snap/pretrain/VLT5/Epoch30 \ --batch_size 90 \
Platform
OS: Ubuntu GPU: A100
Update:
It seems the unexpected_keys warning is not the reason of this low performance. The unexpected_keys message disappears when I use the model further pretrained on VCR, however the val and test performance is still low (i.e. nearly 0.6% on val and test). Then we try to constrain the decoding and only generate vis_extra_id_
tokens, resulting a 1% accuracy on test.
I downloaded the model weight pre-trained on VG&COCO and pre-processed features following the instruction in README. Then I tested the zero-shot grounding performance of VL-T5 on RefCOCOg dataset following the guidance. However the performance on val and test split are both zero, which really confuse me.
Then I tested the few-shot performance with VL-T5 and get reasonable result (44.53% acc on val split with four samples). I was wondering if it is the weight not used (see the log in below) when initializing RefCOCOModel from pre-trained weight that cause such big gap between the zero-shot performance and few-shot performance?
Command to Reproduce the Results
Logs and Other Information
Log
Script
Content of
scripts/RefCOCOg_VLT5.sh
(onlylr
andepochs
params changed):Platform
OS: Ubuntu GPU: A100