Open rajveerb opened 1 year ago
My research logs regarding the above issues so far: https://docs.google.com/document/d/10rrR1QECqUdl7mIzCgBa5y--bY1zZwlMnJRYzyXAVkw/edit?usp=sharing
@harshithlanka3
Great! Can you add them here please.
Should I just add them as comments directly here?
@harshithlanka3
yes
Video Classification Attributing some label with a given video Ex. sports
Action/Gesture Recognition Human actions/gestures within a given video
Event Detection Detecting specific events in a video Accidents on a road Unusual activity in a crowd
Scene Classification Categorizing videos based on the scene/environment in the video Indoor scenes, outdoor scenes, urban landscapes, etc
Object Detection/Tracking Locating objects of interest in a video Tracking these objects of interest
There are a large number of possible training related to video datasets.
Found this github page for a list of all relevant datasets for different kinds of video based machine learning models
https://github.com/xiaobai1217/Awesome-Video-Datasets
Other large/famous video datasets UCF101: 13320 videos and 101 action classes Good for action recognition
HMDB51: 6849 videos and 51 action classes
Kinetics: 400,000 videos 600 action classes
Youtube-8M: Youtube video urls and 4716 vocab classes General classification tasks
Sports-1M: 1 million videos from 487 classes of sports Good for video classification
Video Classification: CNN-LSTM Model: Convolutional Neural Networks (CNNs) for spatial features and Long Short-Term Memory (LSTM) networks for temporal features (Might be too big for our use case) 3D CNNs: Models like C3D (Convolutional 3D) or R(2+1)D
Action/Gesture Recognition: I3D (Inflated 3D ConvNet): action recognition in videos. You can use smaller variants of I3D for faster inference. Temporal Convolutional Networks (TCNs): TCNs are lightweight and can be used for gesture recognition.
Event Detection: Two-Stream Networks: Combine two CNN streams (one for spatial and one for optical flow) and fuse their features to detect events. Smaller versions of CNNs can be used here. Single Shot MultiBox Detector (SSD): For detecting events like accidents
Object Detection/Tracking: YOLO (You Only Look Once): YOLO models, YOLOv3-tiny or YOLOv4-tiny, SORT (Simple Online and Realtime Tracking): For object tracking, SORT is a simple yet effective choice. Needs to be used with a detector model as well
Identify preprocessing operations applied on video datasets, for instance, images use decoding, RandomResizedCrop, RandomHorizontalFlip, ToTensor, and Normalize. For images, @rajveerb referred to MLPerf's example. Maybe look at research papers as well.
Some papers that I need to read over/get to Learning Spatiotemporal Features with 3D Convolutional Networks Beyond Short Snippets: Deep Networks for Video Classification Unsupervised Learning of Video Representations using LSTMs SIMPLE ONLINE AND REALTIME TRACKING
In general what I have noticed for pre processing so far: Video loading Frame Resizing Temporal sampling (for certain tasks not all) Data augmentation like flipping, rotating, etc, can be applied Normalization Batching
From Harshith's research logs:
Preprocessing on C3D:
Detection, Tracking, and Counting Meets Drones in Crowds: A Benchmark:
I could not really find scene classification for video related datasets as they were all applications of image scene recognition models. I shifted my focus from that to action recognition instead as I was able to find more 'relevant' papers2:
Here are my findings so far on two papers that I read:
HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences
Datasets used:
Processing used:
Rolling Rotations for Recognizing Human Actions from 3D Skeletal Data
Code provided: http://ravitejav.weebly.com/rolling.html
Other datasets that should be looked at: UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild Sports-1M: Large-scale Video Classification with Convolutional Neural Networks Kinetics
Papers to read: Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
The above papers mentioned are related to action recognition, but these papers are a bit old. I looked at papers citing above papers and I found some recent work related to this task with good datasets.
(Dated: 2021) LAEO-Net++: revisiting people Looking At Each Other in videos
This mentions preprocessing as data augmentation. Also, above has a good set of video datasets.
From ActivityNet: Found interactive tool for to find papers from CVPR 2022: https://public.tableau.com/views/CVPR2022/Dashboard1?:showVizHome=no
Datasets I saw most often in CVPR 2022: ActivityNet THUMOS14 UCF101 Kinetics
This paper seemed to be cited a lot: https://ieeexplore.ieee.org/document/8454294 Random cropping, horizontal flipping, corner cropping, scale jittering
Paper I want to use: https://ieeexplore.ieee.org/document/8454294 Dataset I want to use: UCF101 Implemetation found? (Still struggling): https://github.com/yjxiong/temporal-segment-networks/tree/master
Found this as an update prior/during Friday meeting: https://github.com/yjxiong/tsn-pytorch/tree/master
Notes:
updated github code: https://github.com/SilvioGiancola/SoccerNetv2-DevKit from the paper we found from our quick meeting from 10-20-2023:
For reference, here is the paper on CALF (Context Aware Loss Function): https://openaccess.thecvf.com/content_CVPR_2020/papers/Cioppa_A_Context-Aware_Loss_Function_for_Action_Spotting_in_Soccer_Videos_CVPR_2020_paper.pdf
Dataset used: Soccer-net V2 (https://arxiv.org/pdf/2011.13367.pdf)
Data augmentation used: Re-encoding twice on chunks of 2 minutes while looking for 5 actions at a time:
I don't really understand the how the dataset is being used: https://github.com/SilvioGiancola/SoccerNetv2-DevKit/tree/main/Download
Source code for the entire model based on just this paper: https://github.com/SilvioGiancola/SoccerNetv2-DevKit/tree/main/Task1-ActionSpotting/CALF
There are newer variants of CALF from 2021 that are also available but I believe we should decide on a model and just move forward with it. Using the models I had found earlier that use more traditional approaches to data augmentation are still an option.
Next steps: Figure out where decoding is happening Figure out if decoding is happening on CPU or GPU Actually figure out what the dataset is doing Schedule meeting for Friday
Above is small diagram of what the pipeline looks like in general specifying CPU and GPU utilization Dataset is 500 games around 90 minutes each uncut Video Decoding happens in feature extraction and reduction with resize and/or crop
Next steps: Install dataset and run code: How? Do we want to do feature extraction and reduction every single time? How can I make sure certain parts are CPU vs GPU?
@harshithlanka3
In the above pipeline, which stage's output of the preprocessing does the github code for the paper use?
It uses the features after the pca512 in the paper.
Successfully downloaded SoccerNetV2 Features Dataset 2fps reduced with PCA to 512 dimensions:
To run the 'pre-processing' part of the pipeline we need access to the actual videos which does require signing the NDA as far as I know.
Command used: python src/main.py --SoccerNet_path=/nethome/hlanka3/SoccerNetv2-Features \ --features=ResNET_TF2_PCA512.npy \ --num_features=512 \ --model_name=CALF_v2 \ --batch_size 32 \ --evaluation_frequency 20 \ --chunks_per_epoch 18000 \
.npy files are in fact just reduced features for every single frame
temporal aspect/video layer is maintained by manually keeping track of the time that has passed based on the fps of the original video. Essentially, training on the dataset by associating the time given by the actions by which frame number it belongs to. Time for each action is given in Labels-v2.json.
Had to use newer version of CUDA compared to what the github suggested
Had to import charset as another module for some reason
@harshithlanka3
It's great that you got it to work.
Can you figure out which stages of the pipeline are performed on the GPU and the one's on the CPU?
@harshithlanka3
Link to form for SoccerNet video - https://docs.google.com/forms/d/e/1FAIpQLSfYFqjZNm4IgwGnyJXDPk2Ko_lZcbVtYX73w5lf6din5nxfmA/viewform
If this does not work, let Rajveer know to contact Pramod from DB group.
Updated pipeline diagram with CPU vs GPU usage
@harshithlanka3
All we need now is time spent by each of these boxes and the compute resource used i.e. CPU or GPU.
Similar to benchmarks created for image classification, we would like to create a benchmark for video datasets.
A few tasks for tackling video datasets: