Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds

By Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, Katerina Fragkiadaki.

Official implementation of "Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds", accepted by ECCV 2022.

teaser

Note:

This is the code for the 3D BUTD-DETR. For the 2D version check the branch bdetr2d.

Install

Requirements

We showcase the installation for CUDA 11.1 and torch==1.10.2, which is what we used for our experiments. If you need to use a different version, you can try to modify environment.yml accordingly.

Install environment: conda env create -f environment.yml --name bdetr3d
Activate environment: conda activate bdetr3d
Install torch: pip install -U torch==1.10.2 torchvision==0.11.3 --extra-index-url https://download.pytorch.org/whl/cu111
Compile the CUDA layers for PointNet++, which we used in the backbone network: sh init.sh

Data preparation

Download ScanNet v2 data HERE. Let DATA_ROOT be the path to folder that contains the downloaded annotations. Under DATA_ROOT there should be a folder scans. Under scans there should be folders with names like scene0001_01. We provide a script to download only the relative annotations for our task. Run python scripts/download_scannet_files.py. Note that the original ScanNet script is written for python2.
Download ReferIt3D annotations following the instructions HERE. Place all .csv files under DATA_ROOT/refer_it_3d/.
Download ScanRefer annotations following the instructions HERE. Place all files under DATA_ROOT/scanrefer/.
Download object detector's outputs. Unzip inside DATA_ROOT. Here is the group-free checkpoint we used to get these boxes in case you need it
Download span predictor's outputs inside DATA_ROOT: ScanRefer_train, ScanRefer_val, SR3D, NR3D.
(optional) Download PointNet++ checkpoint into DATA_ROOT.
Run python prepare_data.py --data_root DATA_ROOT specifying your DATA_ROOT. This will create two .pkl files and has to only run once.

Usage

sh scripts/train_test_det.sh to train/test BUTD-DETR. You need to modify the script by providing DATA_ROOT.
sh scripts/train_test_cls.sh to train/test BUTD-DETR with ground-truth boxes (not classes). Again, you need to modify the script by providing DATA_ROOT.

The above scripts will run training and evaluation on SR3D. You can edit the following to customize training:

Use TRAIN_DATASET (can be sr3d, nr3d, scanrefer, scannet, sr3d+) to change the training dataset.
Use TEST_DATASET (does not have to be the same as TRAIN_DATASET) to change the validation dataset.
Add --eval to skip training and just evaluate.
To train on multiple datasets, e.g. on SR3D and NR3D simultaneously, set --TRAIN_DATASET sr3d nr3d.
On NR3D and ScanRefer we need much more training epochs to converge. It's better to monitor the validation accuracy and decrease learning rate accordingly. For example, in det setup, we decrease lr at epochs 80 and 90 for NR3D and at epoch 65 for Scanrefer. To disable automatic learning rate decay, you can remove --lr_decay_epochs from the train script and manually decrease the learning rate when the validation accuracy converges. Be sure to add --reduce_lr flag when decreasing learning rate and continuing from a checkpoint to load optimizers correctly.
(Optional) To train a span predictor cd src and python text_cls.py --dataset DATASET.

Pre-trained checkpoints

Download our checkpoints for SR3D_det, NR3D_det, ScanRefer_det, SR3D_cls, NR3D_cls. Add --checkpoint_path CKPT_NAME to the above scripts in order to utilize the stored checkpoints.

Note that these checkpoints were stored while using DistributedDataParallel. To use them outside these checkpoints without DistributedDataParallel, take a look here.

Lastly, we also release the checkpoints for span prediction (ScanRefer, SR3D, NR3D)

How does the evaluation work?

For each object query, we compute per-token confidence scores and regress bounding boxes.
Given a target span, we keep the most confident query for it. This is our model's best guess.
We compute the IoU of the corresponding box and the ground-truth box.
We check whether this IoU is greater than the thresholds (0.25, 0.5).

Acknowledgements

Parts of this code were based on the codebase of Group-Free. The loss implementation (Hungarian matching and criterion class) are based on the codebase of MDETR.

Citing BUTD-DETR

If you find BUTD-DETR useful in your research, please consider citing:

@misc{https://doi.org/10.48550/arxiv.2112.08879,
        doi = {10.48550/ARXIV.2112.08879},
        url = {https://arxiv.org/abs/2112.08879},
        author = {Jain, Ayush and Gkanatsios, Nikolaos and Mediratta, Ishita and Fragkiadaki, Katerina},
        keywords = {Computer Vision and Pattern Recognition (cs.CV), Computation and Language (cs.CL), FOS: Computer and information sciences, FOS:    Computer and information sciences},
        title = {Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds},
        publisher = {arXiv},
        year = {2021},
        copyright = {Creative Commons Attribution 4.0 International}
}

License

The majority of BUTD-DETR code is licensed under CC-BY-NC, however portions of the project are available under separate license terms: MDETR is licensed under the Apache 2.0 license; and Group-Free is licensed under the MIT license.

nickgkan / butd_detr

readme