rajveerb / ml-pipeline-benchmark

1 stars 0 forks source link

Benchmarking videos #2

Open rajveerb opened 10 months ago

rajveerb commented 10 months ago

Similar to benchmarks created for image classification, we would like to create a benchmark for video datasets.

A few tasks for tackling video datasets:

  1. Identify ML training related to video datasets, for instance, image classification for image datasets.
  2. Identify video datasets commonly used to for tasks in 1.
  3. Identify ML models needed for above tasks, ideally find smaller models used.
  4. Identify preprocessing operations applied on video datasets, for instance, images use decoding, RandomResizedCrop, RandomHorizontalFlip, ToTensor, and Normalize. For images, @rajveerb referred to MLPerf's example. Maybe look at research papers as well.
harshithlanka3 commented 9 months ago

My research logs regarding the above issues so far: https://docs.google.com/document/d/10rrR1QECqUdl7mIzCgBa5y--bY1zZwlMnJRYzyXAVkw/edit?usp=sharing

rajveerb commented 9 months ago

@harshithlanka3

Great! Can you add them here please.

harshithlanka3 commented 9 months ago

Should I just add them as comments directly here?

rajveerb commented 9 months ago

@harshithlanka3

yes

harshithlanka3 commented 9 months ago
  1. Identify ML training related to video datasets, for instance, image classification for image datasets.

Video Classification Attributing some label with a given video Ex. sports

Action/Gesture Recognition Human actions/gestures within a given video

Event Detection Detecting specific events in a video Accidents on a road Unusual activity in a crowd

Scene Classification Categorizing videos based on the scene/environment in the video Indoor scenes, outdoor scenes, urban landscapes, etc

Object Detection/Tracking Locating objects of interest in a video Tracking these objects of interest

There are a large number of possible training related to video datasets.

  1. Identify video datasets commonly used to for tasks in 1.

Found this github page for a list of all relevant datasets for different kinds of video based machine learning models

https://github.com/xiaobai1217/Awesome-Video-Datasets

Other large/famous video datasets UCF101: 13320 videos and 101 action classes Good for action recognition

HMDB51: 6849 videos and 51 action classes

Kinetics: 400,000 videos 600 action classes

Youtube-8M: Youtube video urls and 4716 vocab classes General classification tasks

Sports-1M: 1 million videos from 487 classes of sports Good for video classification

  1. Identify ML models needed for above tasks, ideally find smaller models used.

Video Classification: CNN-LSTM Model: Convolutional Neural Networks (CNNs) for spatial features and Long Short-Term Memory (LSTM) networks for temporal features (Might be too big for our use case) 3D CNNs: Models like C3D (Convolutional 3D) or R(2+1)D

Action/Gesture Recognition: I3D (Inflated 3D ConvNet): action recognition in videos. You can use smaller variants of I3D for faster inference. Temporal Convolutional Networks (TCNs): TCNs are lightweight and can be used for gesture recognition.

Event Detection: Two-Stream Networks: Combine two CNN streams (one for spatial and one for optical flow) and fuse their features to detect events. Smaller versions of CNNs can be used here. Single Shot MultiBox Detector (SSD): For detecting events like accidents

Object Detection/Tracking: YOLO (You Only Look Once): YOLO models, YOLOv3-tiny or YOLOv4-tiny, SORT (Simple Online and Realtime Tracking): For object tracking, SORT is a simple yet effective choice. Needs to be used with a detector model as well

Identify preprocessing operations applied on video datasets, for instance, images use decoding, RandomResizedCrop, RandomHorizontalFlip, ToTensor, and Normalize. For images, @rajveerb referred to MLPerf's example. Maybe look at research papers as well.

Some papers that I need to read over/get to Learning Spatiotemporal Features with 3D Convolutional Networks Beyond Short Snippets: Deep Networks for Video Classification Unsupervised Learning of Video Representations using LSTMs SIMPLE ONLINE AND REALTIME TRACKING

In general what I have noticed for pre processing so far: Video loading Frame Resizing Temporal sampling (for certain tasks not all) Data augmentation like flipping, rotating, etc, can be applied Normalization Batching

rajveerb commented 9 months ago

From Harshith's research logs:

Preprocessing on C3D:

Detection, Tracking, and Counting Meets Drones in Crowds: A Benchmark:

harshithlanka3 commented 8 months ago

I could not really find scene classification for video related datasets as they were all applications of image scene recognition models. I shifted my focus from that to action recognition instead as I was able to find more 'relevant' papers2:

Here are my findings so far on two papers that I read:

HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences

Rolling Rotations for Recognizing Human Actions from 3D Skeletal Data

Code provided: http://ravitejav.weebly.com/rolling.html

Other datasets that should be looked at: UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild Sports-1M: Large-scale Video Classification with Convolutional Neural Networks Kinetics

Papers to read: Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

rajveerb commented 8 months ago

The above papers mentioned are related to action recognition, but these papers are a bit old. I looked at papers citing above papers and I found some recent work related to this task with good datasets.

(Dated: 2021) LAEO-Net++: revisiting people Looking At Each Other in videos

This mentions preprocessing as data augmentation. Also, above has a good set of video datasets.

kexinrong commented 8 months ago

http://activity-net.org/

harshithlanka3 commented 8 months ago

From ActivityNet: Found interactive tool for to find papers from CVPR 2022: https://public.tableau.com/views/CVPR2022/Dashboard1?:showVizHome=no

Datasets I saw most often in CVPR 2022: ActivityNet THUMOS14 UCF101 Kinetics

This paper seemed to be cited a lot: https://ieeexplore.ieee.org/document/8454294 Random cropping, horizontal flipping, corner cropping, scale jittering

harshithlanka3 commented 8 months ago

Paper I want to use: https://ieeexplore.ieee.org/document/8454294 Dataset I want to use: UCF101 Implemetation found? (Still struggling): https://github.com/yjxiong/temporal-segment-networks/tree/master

Found this as an update prior/during Friday meeting: https://github.com/yjxiong/tsn-pytorch/tree/master

harshithlanka3 commented 8 months ago

Notes:

harshithlanka3 commented 8 months ago

updated github code: https://github.com/SilvioGiancola/SoccerNetv2-DevKit from the paper we found from our quick meeting from 10-20-2023:

For reference, here is the paper on CALF (Context Aware Loss Function): https://openaccess.thecvf.com/content_CVPR_2020/papers/Cioppa_A_Context-Aware_Loss_Function_for_Action_Spotting_in_Soccer_Videos_CVPR_2020_paper.pdf

Dataset used: Soccer-net V2 (https://arxiv.org/pdf/2011.13367.pdf)

Data augmentation used: Re-encoding twice on chunks of 2 minutes while looking for 5 actions at a time:

Screenshot 2023-10-31 at 17 25 11

I don't really understand the how the dataset is being used: https://github.com/SilvioGiancola/SoccerNetv2-DevKit/tree/main/Download

Source code for the entire model based on just this paper: https://github.com/SilvioGiancola/SoccerNetv2-DevKit/tree/main/Task1-ActionSpotting/CALF

There are newer variants of CALF from 2021 that are also available but I believe we should decide on a model and just move forward with it. Using the models I had found earlier that use more traditional approaches to data augmentation are still an option.

harshithlanka3 commented 8 months ago

Next steps: Figure out where decoding is happening Figure out if decoding is happening on CPU or GPU Actually figure out what the dataset is doing Schedule meeting for Friday

harshithlanka3 commented 7 months ago

Stuff-106 Above is small diagram of what the pipeline looks like in general specifying CPU and GPU utilization Dataset is 500 games around 90 minutes each uncut Video Decoding happens in feature extraction and reduction with resize and/or crop

Next steps: Install dataset and run code: How? Do we want to do feature extraction and reduction every single time? How can I make sure certain parts are CPU vs GPU?

rajveerb commented 7 months ago

@harshithlanka3

In the above pipeline, which stage's output of the preprocessing does the github code for the paper use?

harshithlanka3 commented 7 months ago

It uses the features after the pca512 in the paper.

harshithlanka3 commented 7 months ago

Successfully downloaded SoccerNetV2 Features Dataset 2fps reduced with PCA to 512 dimensions:

rajveerb commented 7 months ago

@harshithlanka3

It's great that you got it to work.

Can you figure out which stages of the pipeline are performed on the GPU and the one's on the CPU?

rajveerb commented 7 months ago

@harshithlanka3

Link to form for SoccerNet video - https://docs.google.com/forms/d/e/1FAIpQLSfYFqjZNm4IgwGnyJXDPk2Ko_lZcbVtYX73w5lf6din5nxfmA/viewform

If this does not work, let Rajveer know to contact Pramod from DB group.

harshithlanka3 commented 7 months ago

Updated pipeline diagram with CPU vs GPU usage Stuff-110

rajveerb commented 7 months ago

@harshithlanka3

All we need now is time spent by each of these boxes and the compute resource used i.e. CPU or GPU.