j-min / VL-T5

PyTorch code for "Unifying Vision-and-Language Tasks via Text Generation" (ICML 2021)
https://arxiv.org/abs/2102.02779
MIT License
360 stars 57 forks source link

Extremely low zero-shot performance (0% acc on both val and test) on RefCOCOg #30

Open yiranyyu opened 2 years ago

yiranyyu commented 2 years ago

I downloaded the model weight pre-trained on VG&COCO and pre-processed features following the instruction in README. Then I tested the zero-shot grounding performance of VL-T5 on RefCOCOg dataset following the guidance. However the performance on val and test split are both zero, which really confuse me.

Then I tested the few-shot performance with VL-T5 and get reasonable result (44.53% acc on val split with four samples). I was wondering if it is the weight not used (see the log in below) when initializing RefCOCOModel from pre-trained weight that cause such big gap between the zero-shot performance and few-shot performance?

Command to Reproduce the Results

cd VL-T5/

# modify scripts/RefCOCOg_VLT5.sh to set the `lr` param to 0, set epoch to 1
vim scripts/RefCOCOg_VLT5.sh

# modify #304 of src/refcoco from `>` to `>=` to save the zero acc checkpoint for testing
vim src/refcoco.py

# run the training script
cd VL-T5/
bash scripts/RefCOCOg_VLT5.sh 4

Logs and Other Information

Log

Building Model at GPU 0
Building Model at GPU 3
Building Model at GPU 1
Building Model at GPU 2
Some weights of VLT5RefCOCO were not initialized from the model checkpoint at t5-base and are newly initialized: ['encoder.visual_embedding.feat_embedding.0.weight', 'encoder.visual_embedding.feat_embedding.0.bias', 'encoder.visual_embedding.absolute_vis_pos_embedding.0.weight', 'encoder.visual_embedding.absolute_vis_pos_embedding.0.bias', 'encoder.visual_embedding.obj_order_embedding.weight', 'encoder.visual_embedding.img_order_embedding.weight', 'encoder.visual_embedding.layer_norm.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Model Launching at GPU 3
Model Launching at GPU 1
Model Launching at GPU 2
Model loaded from  snap/pretrain/VLT5/Epoch30.pth
_IncompatibleKeys(missing_keys=[], unexpected_keys=['encoder.visual_embedding.feat_embedding.1.weight', 'encoder.visual_embedding.absolute_vis_pos_embedding.1.weight'])

Xnip2022-10-26_20-22-44

Script

Content of scripts/RefCOCOg_VLT5.sh (only lr and epochs params changed):

# The name of experiment
name=VLT5

output=snap/refcocog/$name

PYTHONPATH=$PYTHONPATH:./src \
python -m torch.distributed.launch \
    --nproc_per_node=$1 \
    src/refcoco.py \
        --distributed --multiGPU \
        --train train \
        --valid val \
        --test test \
        --optim adamw \
        --warmup_ratio 0.1 \
        --clip_grad_norm 5 \
        --lr 0e-5 \
        --epochs 1 \
        --num_workers 4 \
        --backbone 't5-base' \
        --output $output ${@:2} \
        --load snap/pretrain/VLT5/Epoch30 \
        --batch_size 90 \

Platform

OS: Ubuntu GPU: A100

yiranyyu commented 2 years ago

I downloaded the model weight pre-trained on VG&COCO and pre-processed features following the instruction in README. Then I tested the zero-shot grounding performance of VL-T5 on RefCOCOg dataset following the guidance. However the performance on val and test split are both zero, which really confuse me.

Then I tested the few-shot performance with VL-T5 and get reasonable result (44.53% acc on val split with four samples). I was wondering if it is the weight not used (see the log in below) when initializing RefCOCOModel from pre-trained weight that cause such big gap between the zero-shot performance and few-shot performance?

Command to Reproduce the Results

cd VL-T5/

# modify scripts/RefCOCOg_VLT5.sh to set the `lr` param to 0, set epoch to 1
vim scripts/RefCOCOg_VLT5.sh

# modify #304 of src/refcoco from `>` to `>=` to save the zero acc checkpoint for testing
vim src/refcoco.py

# run the training script
cd VL-T5/
bash scripts/RefCOCOg_VLT5.sh 4

Logs and Other Information

Log

Building Model at GPU 0
Building Model at GPU 3
Building Model at GPU 1
Building Model at GPU 2
Some weights of VLT5RefCOCO were not initialized from the model checkpoint at t5-base and are newly initialized: ['encoder.visual_embedding.feat_embedding.0.weight', 'encoder.visual_embedding.feat_embedding.0.bias', 'encoder.visual_embedding.absolute_vis_pos_embedding.0.weight', 'encoder.visual_embedding.absolute_vis_pos_embedding.0.bias', 'encoder.visual_embedding.obj_order_embedding.weight', 'encoder.visual_embedding.img_order_embedding.weight', 'encoder.visual_embedding.layer_norm.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Model Launching at GPU 3
Model Launching at GPU 1
Model Launching at GPU 2
Model loaded from  snap/pretrain/VLT5/Epoch30.pth
_IncompatibleKeys(missing_keys=[], unexpected_keys=['encoder.visual_embedding.feat_embedding.1.weight', 'encoder.visual_embedding.absolute_vis_pos_embedding.1.weight'])

Xnip2022-10-26_20-22-44

Script

Content of scripts/RefCOCOg_VLT5.sh (only lr and epochs params changed):

# The name of experiment
name=VLT5

output=snap/refcocog/$name

PYTHONPATH=$PYTHONPATH:./src \
python -m torch.distributed.launch \
    --nproc_per_node=$1 \
    src/refcoco.py \
        --distributed --multiGPU \
        --train train \
        --valid val \
        --test test \
        --optim adamw \
        --warmup_ratio 0.1 \
        --clip_grad_norm 5 \
        --lr 0e-5 \
        --epochs 1 \
        --num_workers 4 \
        --backbone 't5-base' \
        --output $output ${@:2} \
        --load snap/pretrain/VLT5/Epoch30 \
        --batch_size 90 \

Platform

OS: Ubuntu GPU: A100

Update:

It seems the unexpected_keys warning is not the reason of this low performance. The unexpected_keys message disappears when I use the model further pretrained on VCR, however the val and test performance is still low (i.e. nearly 0.6% on val and test). Then we try to constrain the decoding and only generate vis_extra_id_ tokens, resulting a 1% accuracy on test.