Object detection as features #6

pakoromilas commented 3 years ago

Object detection:

Detects objects of 80 categories or 12 super-categories by using NVIDIA's SSD model (https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Detection/SSD).
Smooths the confidences across a specified number of frames and keeps the objects that fulflil the confidence criteria. The output bounding box for each object is the smallest possible.
- Extracts object related statistical features, such as label frequency, average confidence and area ratio. The output features are then stored to a csv file. The features can be extracted for:
  1. the 80 categories
  2. the 12 super-categories
  3. both categories and super-categories.

Code Refactoring:

Many functions for frame analysis moved to a new file (utils.py)
It is now possible to run the code without displaying any windows and videos, by using the online_display flag.

tyiannak commented 3 years ago

Getting this error

raise RuntimeError('Attempting to deserialize object on a CUDA '

RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

tyiannak commented 3 years ago

why is frame area fixed (frame_area = 300 * 300 in detection_utils)
also please add object-related features both in the final feature matrix and the feature statistics (so we need to keep both frame-level object features and final statistics)
please add the feature names in the process_video (as a fourth returned variable). This should be a list of strings of the same length with the feature arrays

pakoromilas commented 3 years ago

1. why is frame area fixed (frame_area = 300 * 300 in detection_utils)

2. also please add object-related features both in the final feature matrix and the feature statistics (so we need to keep both frame-level object features and final statistics)

3. please add the feature names in the process_video (as a fourth returned variable). This should be a list of strings of the same length with the feature arrays

The Nvidia SSD model works only for frames of this size. Every time I work with it I transform the frame and do the necessary calculations.
I'm a bit confused here. The object features are represented throw stats across frames. So it is reasonable to include these features on the feature_stats vector. On the other hand, the feature matrix represents features for every frame. In order to include some object features to the feature matrix I will probably have to introduce a new calculation for every frame. For example, at fifth frame there are 2 persons with average confidence of 0.9 and box area ratio of 0.6. Is this the right way to add these features to the feature matrix?

tyiannak commented 3 years ago

1. why is frame area fixed (frame_area = 300 * 300 in detection_utils)

2. also please add object-related features both in the final feature matrix and the feature statistics (so we need to keep both frame-level object features and final statistics)

3. please add the feature names in the process_video (as a fourth returned variable). This should be a list of strings of the same length with the feature arrays
The Nvidia SSD model works only for frames of this size. Every time I work with it I transform the frame and do the necessary calculations.

I'm a bit confused here. The object features are represented throw stats across frames. So it is reasonable to include these features on the feature_stats vector. On the other hand, the feature matrix represents features for every frame. In order to include some object features to the feature matrix I will probably have to introduce a new calculation for every frame. For example, at fifth frame there are 2 persons with average confidence of 0.9 and box area ratio of 0.6. Is this the right way to add these features to the feature matrix?

ok
yes lets add a new per-frame calculation as u described it. For the time being lets keep it aggregated per frame (i.e. if there are two faces you aggregate the confidences and the area as their average per frame. The count is still a number ofcourse ).

pakoromilas commented 3 years ago

I made the proposed changes. Please check if everything works fine.

tyiannak commented 3 years ago

great approving and mergin @lobracost

tyiannak / multimodal_movie_analysis

Object detection as features #6 #22