Benchmarking videos - Githubissues

rajveerb commented 10 months ago

Similar to benchmarks created for image classification, we would like to create a benchmark for video datasets.

A few tasks for tackling video datasets:

Identify ML training related to video datasets, for instance, image classification for image datasets.
Identify video datasets commonly used to for tasks in 1.
Identify ML models needed for above tasks, ideally find smaller models used.
Identify preprocessing operations applied on video datasets, for instance, images use decoding, RandomResizedCrop, RandomHorizontalFlip, ToTensor, and Normalize. For images, @rajveerb referred to MLPerf's example. Maybe look at research papers as well.

harshithlanka3 commented 9 months ago

My research logs regarding the above issues so far: https://docs.google.com/document/d/10rrR1QECqUdl7mIzCgBa5y--bY1zZwlMnJRYzyXAVkw/edit?usp=sharing

rajveerb commented 9 months ago

@harshithlanka3

Great! Can you add them here please.

harshithlanka3 commented 9 months ago

Should I just add them as comments directly here?

rajveerb commented 9 months ago

@harshithlanka3

yes

harshithlanka3 commented 9 months ago

Identify ML training related to video datasets, for instance, image classification for image datasets.

Video Classification Attributing some label with a given video Ex. sports

Action/Gesture Recognition Human actions/gestures within a given video

Event Detection Detecting specific events in a video Accidents on a road Unusual activity in a crowd

Scene Classification Categorizing videos based on the scene/environment in the video Indoor scenes, outdoor scenes, urban landscapes, etc

Object Detection/Tracking Locating objects of interest in a video Tracking these objects of interest

There are a large number of possible training related to video datasets.

Identify video datasets commonly used to for tasks in 1.

Found this github page for a list of all relevant datasets for different kinds of video based machine learning models

https://github.com/xiaobai1217/Awesome-Video-Datasets

Other large/famous video datasets UCF101: 13320 videos and 101 action classes Good for action recognition

HMDB51: 6849 videos and 51 action classes

Kinetics: 400,000 videos 600 action classes

Youtube-8M: Youtube video urls and 4716 vocab classes General classification tasks

Sports-1M: 1 million videos from 487 classes of sports Good for video classification

Identify ML models needed for above tasks, ideally find smaller models used.

Video Classification: CNN-LSTM Model: Convolutional Neural Networks (CNNs) for spatial features and Long Short-Term Memory (LSTM) networks for temporal features (Might be too big for our use case) 3D CNNs: Models like C3D (Convolutional 3D) or R(2+1)D

Action/Gesture Recognition: I3D (Inflated 3D ConvNet): action recognition in videos. You can use smaller variants of I3D for faster inference. Temporal Convolutional Networks (TCNs): TCNs are lightweight and can be used for gesture recognition.

Event Detection: Two-Stream Networks: Combine two CNN streams (one for spatial and one for optical flow) and fuse their features to detect events. Smaller versions of CNNs can be used here. Single Shot MultiBox Detector (SSD): For detecting events like accidents

Object Detection/Tracking: YOLO (You Only Look Once): YOLO models, YOLOv3-tiny or YOLOv4-tiny, SORT (Simple Online and Realtime Tracking): For object tracking, SORT is a simple yet effective choice. Needs to be used with a detector model as well

Identify preprocessing operations applied on video datasets, for instance, images use decoding, RandomResizedCrop, RandomHorizontalFlip, ToTensor, and Normalize. For images, @rajveerb referred to MLPerf's example. Maybe look at research papers as well.

Some papers that I need to read over/get to Learning Spatiotemporal Features with 3D Convolutional Networks Beyond Short Snippets: Deep Networks for Video Classification Unsupervised Learning of Video Representations using LSTMs SIMPLE ONLINE AND REALTIME TRACKING

In general what I have noticed for pre processing so far: Video loading Frame Resizing Temporal sampling (for certain tasks not all) Data augmentation like flipping, rotating, etc, can be applied Normalization Batching

rajveerb commented 9 months ago

From Harshith's research logs:

Preprocessing on C3D:

Split each video into five 2-second clips
Each clip is randomly cropped to be 16x112x112 for both spatial and temporal jittering
50% chance to be randomly flipped

Detection, Tracking, and Counting Meets Drones in Crowds: A Benchmark:

To increase diversity in training data, we randomly flip and crop the training images.
Due to limited computation resources, we equally divide each frame into 2 × 2 patches,and use the divided 4 patches with the resolution of 960 ×540 for training

harshithlanka3 commented 8 months ago

I could not really find scene classification for video related datasets as they were all applications of image scene recognition models. I shifted my focus from that to action recognition instead as I was able to find more 'relevant' papers2:

Here are my findings so far on two papers that I read:

HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences

Datasets used:
- MSR Actions 3D Dataset
- MSR Gesture 3D
- MSR Daily Activity 3D
Processing used:
- frame size in all datasets is 320 × 240
- divided into spatiotemporal cells, which are typically 4×3×3 (w h # of frames)
- 300 projectors created
- Projectors created using their own math that I honestly don’t really understand

Rolling Rotations for Recognizing Human Actions from 3D Skeletal Data

Datasets used:
- Florence3D-Action
- MSRAction Pairs
- G3D-Gaming

Code provided: http://ravitejav.weebly.com/rolling.html

Processing used:
- Skeletal representation,
- Nominal curve computation using DTW
- Rolling and unwrapping
- Linear SVM classification

Other datasets that should be looked at: UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild Sports-1M: Large-scale Video Classification with Convolutional Neural Networks Kinetics

Papers to read: Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

rajveerb commented 8 months ago

The above papers mentioned are related to action recognition, but these papers are a bit old. I looked at papers citing above papers and I found some recent work related to this task with good datasets.

(Dated: 2021) LAEO-Net++: revisiting people Looking At Each Other in videos

This mentions preprocessing as data augmentation. Also, above has a good set of video datasets.

kexinrong commented 8 months ago

http://activity-net.org/

harshithlanka3 commented 8 months ago

From ActivityNet: Found interactive tool for to find papers from CVPR 2022: https://public.tableau.com/views/CVPR2022/Dashboard1?:showVizHome=no

Datasets I saw most often in CVPR 2022: ActivityNet THUMOS14 UCF101 Kinetics

This paper seemed to be cited a lot: https://ieeexplore.ieee.org/document/8454294 Random cropping, horizontal flipping, corner cropping, scale jittering

harshithlanka3 commented 8 months ago

Paper I want to use: https://ieeexplore.ieee.org/document/8454294 Dataset I want to use: UCF101 Implemetation found? (Still struggling): https://github.com/yjxiong/temporal-segment-networks/tree/master

Found this as an update prior/during Friday meeting: https://github.com/yjxiong/tsn-pytorch/tree/master

harshithlanka3 commented 8 months ago

Notes:

Read papers that cited the above paper
Try to find papers that have an implementation in code
Sit down on Friday and go through linked github

harshithlanka3 commented 8 months ago

updated github code: https://github.com/SilvioGiancola/SoccerNetv2-DevKit from the paper we found from our quick meeting from 10-20-2023:

For reference, here is the paper on CALF (Context Aware Loss Function): https://openaccess.thecvf.com/content_CVPR_2020/papers/Cioppa_A_Context-Aware_Loss_Function_for_Action_Spotting_in_Soccer_Videos_CVPR_2020_paper.pdf

Dataset used: Soccer-net V2 (https://arxiv.org/pdf/2011.13367.pdf)

Data augmentation used: Re-encoding twice on chunks of 2 minutes while looking for 5 actions at a time:

Time Shift Encoding

YOLO-like Encoding

I don't really understand the how the dataset is being used: https://github.com/SilvioGiancola/SoccerNetv2-DevKit/tree/main/Download

Might have to sign an NDA for access to videos?
Pre-extracted features for the videos?

Source code for the entire model based on just this paper: https://github.com/SilvioGiancola/SoccerNetv2-DevKit/tree/main/Task1-ActionSpotting/CALF

There are newer variants of CALF from 2021 that are also available but I believe we should decide on a model and just move forward with it. Using the models I had found earlier that use more traditional approaches to data augmentation are still an option.

harshithlanka3 commented 8 months ago

Next steps: Figure out where decoding is happening Figure out if decoding is happening on CPU or GPU Actually figure out what the dataset is doing Schedule meeting for Friday

harshithlanka3 commented 7 months ago

Stuff-106 Above is small diagram of what the pipeline looks like in general specifying CPU and GPU utilization Dataset is 500 games around 90 minutes each uncut Video Decoding happens in feature extraction and reduction with resize and/or crop

Next steps: Install dataset and run code: How? Do we want to do feature extraction and reduction every single time? How can I make sure certain parts are CPU vs GPU?

rajveerb commented 7 months ago

@harshithlanka3

In the above pipeline, which stage's output of the preprocessing does the github code for the paper use?

harshithlanka3 commented 7 months ago

It uses the features after the pca512 in the paper.

harshithlanka3 commented 7 months ago

Successfully downloaded SoccerNetV2 Features Dataset 2fps reduced with PCA to 512 dimensions:

To run the 'pre-processing' part of the pipeline we need access to the actual videos which does require signing the NDA as far as I know.
Command used: python src/main.py --SoccerNet_path=/nethome/hlanka3/SoccerNetv2-Features \ --features=ResNET_TF2_PCA512.npy \ --num_features=512 \ --model_name=CALF_v2 \ --batch_size 32 \ --evaluation_frequency 20 \ --chunks_per_epoch 18000 \
.npy files are in fact just reduced features for every single frame
temporal aspect/video layer is maintained by manually keeping track of the time that has passed based on the fps of the original video. Essentially, training on the dataset by associating the time given by the actions by which frame number it belongs to. Time for each action is given in Labels-v2.json.
Had to use newer version of CUDA compared to what the github suggested
Had to import charset as another module for some reason

rajveerb commented 7 months ago

@harshithlanka3

It's great that you got it to work.

Can you figure out which stages of the pipeline are performed on the GPU and the one's on the CPU?

rajveerb commented 7 months ago

@harshithlanka3

Link to form for SoccerNet video - https://docs.google.com/forms/d/e/1FAIpQLSfYFqjZNm4IgwGnyJXDPk2Ko_lZcbVtYX73w5lf6din5nxfmA/viewform

If this does not work, let Rajveer know to contact Pramod from DB group.

harshithlanka3 commented 7 months ago

Updated pipeline diagram with CPU vs GPU usage Stuff-110

rajveerb commented 7 months ago

@harshithlanka3

All we need now is time spent by each of these boxes and the compute resource used i.e. CPU or GPU.

rajveerb / ml-pipeline-benchmark

Benchmarking videos #2