An unofficial pytorch implementation of "TransVG: End-to-End Visual Grounding with Transformers".
paper: https://arxiv.org/abs/2104.08541
Due to some implementation details, I do not guarantee that I can reproduce the performance in the paper.
If you have any questions about the code please feel free to ask~
image mask
in transformer encoder. I fixed this bug and re-trained now.Dataset | Acc@0.5 | URL |
---|---|---|
ReferItGame | val:68.07 | Google drive |
test:66.97 | Baidu drive[tbuq] |
Create the conda environment with the environment.yaml
file:
conda env create -f environment.yaml
Activate the environment with:
conda activate transvg
./data
./saved_models/detr-r50-e632da11.pth
Train the model using the following commands:
python train.py --data_root XXX --dataset {dataset_name} --gpu {gpu_id}
Evaluate the model using the following commands:
python train.py --test --resume {saved_model_path} --data_root XXX --dataset {dataset_name} --gpu {gpu_id}
Thanks for the work of DETR and ReSC. My code is based on the implementation of them.