sdc17 / UPop

[ICML 2023] UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers.
https://dachuanshi.com/UPop-Project/
BSD 3-Clause "New" or "Revised" License
96 stars 5 forks source link
efficient-deep-learning framework image-captioning image-text-retrieval model-compression multimodal-learning pruning sparsity structured text-image-retrieval vision-language-transformer vision-transformer visual-question-answering visual-reasoning weight-pruning

UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers

Build Paper Paper Code Webiste Blog Blog
Pytorch Pytorch License

🧐 A Quick Look

πŸ₯³ What's New

πŸƒ Installation

The code is tested on Pytorch==1.11.0, cuda==11.3.1, and python==3.8.13. The dependencies can be installed by:

conda env create -f environment.yml

The status of installing dependencies: build

πŸš€ Visual Reasoning on the NLVR2 Dataset

πŸš€ Image Caption on the COCO Caption Dataset

πŸš€ Visual Question Answer on the VQAv2 Dataset

πŸš€ Image-Text and Text-Image Retrieval on the COCO Dataset

πŸš€ Image-Text and Text-Image Retrieval on the Flickr30K Dataset

πŸš€ Image-Text and Text-Image Retrieval on the COCO Dataset with CLIP

πŸš€ Image-Text and Text-Image Retrieval on the Flickr30K Dataset with CLIP

πŸš€ Image Classification on the ImageNet Dataset

πŸš€ Image Segmentation on the Ade20k Dataset

πŸ“‘ Common Issues

1. Evaluation with single GPU

2. Compress with single GPU

3. Out of memory during the evaluation

4. Out of memory during the compression

🌲 Expected Folder Structures

β”œβ”€β”€ annotation
β”‚Β Β  β”œβ”€β”€ answer_list.json
β”‚Β Β  β”œβ”€β”€ coco_gt
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ coco_karpathy_test_gt.json
β”‚Β Β  β”‚Β Β  └── coco_karpathy_val_gt.json
β”‚Β Β  β”œβ”€β”€ ...
β”œβ”€β”€ clip                                               
β”œβ”€β”€ compress_caption.py       
β”œβ”€β”€ compress_deit.py        
β”œβ”€β”€ compress_nlvr.py                  
β”œβ”€β”€ compress ...    
β”œβ”€β”€ configs                                             
β”œβ”€β”€ data                                        
β”œβ”€β”€ datasets
β”‚Β Β  └── vision
β”‚Β Β      β”œβ”€β”€ coco
β”‚Β Β      β”œβ”€β”€ flickr
β”‚Β Β      β”œβ”€β”€ NLVR2     
β”‚Β Β      β”œβ”€β”€ ...                                                                              
β”œβ”€β”€ deit   
β”œβ”€β”€ log                                     
β”œβ”€β”€ models            
β”œβ”€β”€ output                                    
β”œβ”€β”€ pretrained
β”‚   β”œβ”€β”€ bert-base-uncased
β”‚   β”œβ”€β”€ clip_large_retrieval_coco.pth
β”‚   β”œβ”€β”€ clip_large_retrieval_flickr.pth
β”‚   β”œβ”€β”€ ...       
β”œβ”€β”€ segm                                                                                   
β”œβ”€β”€ transform                                                                           
└── utils.py                                

πŸ’¬ Acknowledgments

This code is built upon BLIP, CLIP, DeiT, Segmenter, and timm. Thanks for these awesome open-source projects!

✨ Citation

If you find our work or this code useful, please consider citing the corresponding paper:

@InProceedings{pmlr-v202-shi23e,
  title = {{UP}op: Unified and Progressive Pruning for Compressing Vision-Language Transformers},
  author = {Shi, Dachuan and Tao, Chaofan and Jin, Ying and Yang, Zhendong and Yuan, Chun and Wang, Jiaqi},
  booktitle = {Proceedings of the 40th International Conference on Machine Learning},
  pages = {31292--31311},
  year = {2023},
  volume = {202},
  publisher = {PMLR}
}