snap-research / Panda-70M

[CVPR 2024] Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
https://snap-research.github.io/Panda-70M/
528 stars 19 forks source link

🐼 Panda-70M

This is the offical Github repository of Panda-70M.

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, Sergey Tulyakov
Computer Vision and Pattern Recognition (CVPR) 2024

arXiv Project Page YouTube

Introduction

Panda-70M is a large-scale dataset with 70M high-quality video-caption pairs. This repository have three sections:

🔥 Updates (Oct 2024)

To enhance the training of video generation models, which are intereted at single-shot videos with meaningful motion and aesthetically pleasing scenes, we introduce two additional annotations:

desirable (80.5%) 0_low_desirable_score (5.28%) 1_still_foreground_image (6.82%)
2_tiny_camera_movement (1.20%) 3_screen_in_screen (5.03%) 4_computer_screen_recording (1.13%)
**We will remove the video samples from our dataset / Github / project webpage / technical presentation as long as you need it. Please contact tsaishienchen at gmail dot com for the request. ## Dataset ### Collection Pipeline

### Download | Split | Download | # Source Videos | # Samples | Video Duration | Storage Space| |-----------------|----------|-----------------|-----------|----------------|--------------| | Training (full) | [link](https://drive.google.com/file/d/1pbh8W3qgst9CD7nlPhsH9wmUSWjQlGdW/view?usp=sharing) (2.73 GB) | 3,779,763 | 70,723,513 | 167 khrs | ~36 TB | | Training (10M) | [link](https://drive.google.com/file/d/1LLOFeYw9nZzjT5aA1Wj4oGi5yHUzwSk5/view?usp=sharing) (504 MB) | 3,755,240 | 10,473,922 | 37.0 khrs | ~8.0 TB | | Training (2M) | [link](https://drive.google.com/file/d/1k7NzU6wVNZYl6NxOhLXE7Hz7OrpzNLgB/view?usp=sharing) (118 MB) | 800,000 | 2,400,000 | 7.56 khrs | ~1.6 TB | | Validation | [link](https://drive.google.com/file/d/1uHR5iXS3Sftzw6AwEhyZ9RefipNzBAzt/view?usp=sharing) (1.2 MB) | 2,000 | 6,000 | 18.5 hrs | ~4.0 GB | | Testing | [link](https://drive.google.com/file/d/1BZ9L-157Au1TwmkwlJV8nZQvSRLIiFhq/view?usp=sharing) (1.2 MB) | 2,000 | 6,000 | 18.5 hrs | ~4.0 GB | More details can be found in [Dataset Dataloading](./dataset_dataloading) section. ## Demonstration ### Video-Caption Pairs in Panda-70M
A rhino and a lion are fighting in the dirt. A person is holding a long haired dachshund in their arms. A rocket launches into space on the launch pad.
A person is kneading dough and putting jam on it. A little boy is playing with a basketball in the city. A 3d rendering of a zoo with animals and a train.
A person in blue gloves is connecting an electrical supply to an injector. There is a beach with waves and rocks in the foreground, and a city skyline in the background. It is a rally car driving on a dirt road in the countryside, with people watching from the side of the road.
**We will remove the video samples from our dataset / Github / project webpage / technical presentation as long as you need it. Please contact tsaishienchen at gmail dot com for the request. Please check [here](https://snap-research.github.io/Panda-70M/more_samples) for more samples. ### Long Video Splitting and Captioning https://github.com/snap-research/Panda-70M/assets/3857997/8144cf3d-c20c-4c18-a4bd-011451da9f9b https://github.com/snap-research/Panda-70M/assets/3857997/b262128e-2152-41e8-873e-db2dc275c40f ## License of Panda-70M See [license](https://github.com/snap-research/Panda-70M/blob/main/LICENSE). The video samples are collected from a publicly available dataset. Users must follow [the related license](https://raw.githubusercontent.com/microsoft/XPretrain/main/hd-vila-100m/LICENSE) to use these video samples. ## Citation If you find this project useful for your research, please cite our paper. :blush: ```bibtex @inproceedings{chen2024panda70m, title = {Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers}, author = {Chen, Tsai-Shien and Siarohin, Aliaksandr and Menapace, Willi and Deyneka, Ekaterina and Chao, Hsiang-wei and Jeon, Byung Eun and Fang, Yuwei and Lee, Hsin-Ying and Ren, Jian and Yang, Ming-Hsuan and Tulyakov, Sergey}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, year = {2024} } ``` ## Contact Information **Tsai-Shien Chen**: [tsaishienchen@gmail.com](mailto:tsaishienchen@gmail.com)