nickgkan / butd_detr

Code for the ECCV22 paper "Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds"
Other
74 stars 11 forks source link

Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds

By Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, Katerina Fragkiadaki.

Official implementation of "Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds", accepted by ECCV 2022.

teaser

Note:

This is the code for the 3D BUTD-DETR. For the 2D version check the branch bdetr2d.

Install

Requirements

We showcase the installation for CUDA 11.1 and torch==1.10.2, which is what we used for our experiments. If you need to use a different version, you can try to modify environment.yml accordingly.

Data preparation

Usage

The above scripts will run training and evaluation on SR3D. You can edit the following to customize training:

Pre-trained checkpoints

Download our checkpoints for SR3D_det, NR3D_det, ScanRefer_det, SR3D_cls, NR3D_cls. Add --checkpoint_path CKPT_NAME to the above scripts in order to utilize the stored checkpoints.

Note that these checkpoints were stored while using DistributedDataParallel. To use them outside these checkpoints without DistributedDataParallel, take a look here.

Lastly, we also release the checkpoints for span prediction (ScanRefer, SR3D, NR3D)

How does the evaluation work?

Acknowledgements

Parts of this code were based on the codebase of Group-Free. The loss implementation (Hungarian matching and criterion class) are based on the codebase of MDETR.

Citing BUTD-DETR

If you find BUTD-DETR useful in your research, please consider citing:

@misc{https://doi.org/10.48550/arxiv.2112.08879,
        doi = {10.48550/ARXIV.2112.08879},
        url = {https://arxiv.org/abs/2112.08879},
        author = {Jain, Ayush and Gkanatsios, Nikolaos and Mediratta, Ishita and Fragkiadaki, Katerina},
        keywords = {Computer Vision and Pattern Recognition (cs.CV), Computation and Language (cs.CL), FOS: Computer and information sciences, FOS:    Computer and information sciences},
        title = {Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds},
        publisher = {arXiv},
        year = {2021},
        copyright = {Creative Commons Attribution 4.0 International}
}

License

The majority of BUTD-DETR code is licensed under CC-BY-NC, however portions of the project are available under separate license terms: MDETR is licensed under the Apache 2.0 license; and Group-Free is licensed under the MIT license.