tyiannak / multimodal_movie_analysis

A Python Library for Multimodal Analysis of Movies and Content-based Movie Recommendation
25 stars 8 forks source link

Object Detection As Features #6

Closed tyiannak closed 3 years ago

tyiannak commented 3 years ago

Description:

tyiannak commented 3 years ago

@apoman38 @theopsall please comment your ideas

tyiannak commented 3 years ago

More detailed description after today's call with @lobracost

  1. create an image-based functionality that detects bounding boxes of objects and respective labels and confidences - per image (already done i think)
  2. use 2 in a frame-level in the main python file and display bbs and objects on a new CV window Note: object detection image has to be at a predefined size (300x300?), which can be other than the general image size used in the main processing function. So we need a postprocessing step to plot the bounding boxes from the 300x300 to the standard image size used in the rest of the code
  3. create a new function (probably a new .py file called post_processing or aggregate_results) that (a) takes as input a sequence of [(bounding_box_coordinates_and_sizes, frames, object_labels, confidence)] and generates a new sequence that is "smoothed". (e.g. smooth the confidences per object and then threshold low confidences or cases where an object appears for less than -say- 2 seconds
  4. run the code in a set of images, get the results in a spreadsheet and decide on an object taxonomy (some data investigation is needed here by @tyiannak as well). This needs 1 and 2, 3 is not necessary, but could be done as well
  5. replace objects with object "topics" as defined in 5

@lobracost let me know if i forget sth in this draft planning description

pakoromilas commented 3 years ago

Introduced a new folder under the name object_detection. It will contain everything related to object detection. Created a class the objects of which will be our ssd models. Wrote methods for detection and ploting. Plots are compatible with OpenCV.

pakoromilas commented 3 years ago

No progress today. I tried to implement an online object detection but had some problems. Most of them are solved, but it seems that the neural net can't handle some black frames that occur on shot changes(especially at the beginning or the end of the videos).

tyiannak commented 3 years ago

No progress today. I tried to implement an online object detection but had some problems. Most of them are solved, but it seems that the neural net can't handle some black frames that occur on shot changes(especially at the beginning or the end of the videos).

do u mean it crashes or that it does not find objects during shot transition? Because the latter would not be big of an issue...

pakoromilas commented 3 years ago

No progress today. I tried to implement an online object detection but had some problems. Most of them are solved, but it seems that the neural net can't handle some black frames that occur on shot changes(especially at the beginning or the end of the videos).

do u mean it crashes or that it does not find objects during shot transition? Because the latter would not be big of an issue...

It crashes, but I'll try to find the reason and fix it.

pakoromilas commented 3 years ago

Done today:

Online object detection added to video processing.

The problem was that I was using the nvidia's ssd model from the torch hub and not from the git repo. The torch hub's model wasn't updated and threw an error when nothing was detected. I solved it by modifying one of the model files(one if statement needed) at it's first download. Every time you download it for the first time, our code will modify this specific file.

For the time I don't save the outcome of the object detection to the feature vector. Do you want to just save the categories and the bboxes to the feature vector, or do you have something else in mind?

tyiannak commented 3 years ago

Done today:

Online object detection added to video processing.

The problem was that I was using the nvidia's ssd model from the torch hub and not from the git repo. The torch hub's model wasn't updated and threw an error when nothing was detected. I solved it by modifying one of the model files(one if statement needed) at it's first download. Every time you download it for the first time, our code will modify this specific file.

That's great

For the time I don't save the outcome of the object detection to the feature vector. Do you want to just save the categories and the bboxes to the feature vector, or do you have something else in mind?

Do you mean to the final feature vector we have in visual_analysis? I would say no, not directly the bboxes. Let's add (in this task or in a new one - whatever u prefer) a new function - say get_object_features_from_objects() that takes a list of detected objects (bboxes + labels) and returns a set of features. These features will be added in the final vector by calling that function. Let's say that for a very draft initial version we will add (a) the num of objects in a set of categories (b) their average normalized area. This "set of categories" can be hard-coded for the beggining such as vehicle or car or motorbike or bla bla. Then we can incrementally add more "groups" as we have defined in the next task but for the time lets add just these two dummy features

pakoromilas commented 3 years ago

I made a function that returns 3 object features:

  1. frequency of every label per frame
  2. the average confidence of every object detected
  3. the average area occupied by the labels per frame

@tyiannak since our code can now extract information about 80 objects (including persons), should we keep or remove the haar cascade face detection?

tyiannak commented 3 years ago

I made a function that returns 3 object features:

  1. frequency of every label per frame
  2. the average confidence of every object detected
  3. the average area occupied by the labels per frame

@tyiannak since our code can now extract information about 80 objects (including persons), should we keep or remove the haar cascade face detection?

Does it have both persons and faces as separate types of objects? If it also has faces, then we should remove the haar-based face features. On the other hand, either if it is based on the new object detector or the haar face detector, we will need some separate "statistics" for the faces as final features in the future as faces are probably the most important factor of differentiation of types of shots.

pakoromilas commented 3 years ago

I made a function that returns 3 object features:

  1. frequency of every label per frame
  2. the average confidence of every object detected
  3. the average area occupied by the labels per frame

@tyiannak since our code can now extract information about 80 objects (including persons), should we keep or remove the haar cascade face detection?

Does it have both persons and faces as separate types of objects? If it also has faces, then we should remove the haar-based face features. On the other hand, either if it is based on the new object detector or the haar face detector, we will need some separate "statistics" for the faces as final features in the future as faces are probably the most important factor of differentiation of types of shots.

Unfortunately it only detects persons, not faces. I agree that faces are an important factor. Maybe at sometime, we'll have to find another classifier since haar only recognises frontal faces, which is a problem.

pakoromilas commented 3 years ago

I grouped some of the categories, based on the coco dataset documentation. The code can now extract and save features for these categories. The categories are:

'person': person 'vehicle': bicycle car motorcycle airplane bus train truck boat

'outdoor': traffic light fire hydrant stop sign parking meter bench

'animal': bird cat dog horse sheep cow elephant bear zebra giraffe

'accessory': backpack umbrella handbag tie suitcase

'sports': frisbee skis snowboard sports ball kite baseball bat baseball glove skateboard surfboard tennis racket

'kitchen': bottle wine glass cup fork knife spoon bowl

'food': banana apple sandwich orange broccoli carrot hot dog pizza donut cake

'furniture': chair couch potted plant bed dining table toilet

'electronic': tv laptop mouse remote keyboard cell phone

'appliance': microwave oven toaster sink refrigerator

'indoor': book clock vase scissors teddy bear hair drier toothbrush

tyiannak commented 3 years ago

Seems ok @lobracost Are the values of the dict above the complete list of objects detected initially?

pakoromilas commented 3 years ago

Seems ok @lobracost Are the values of the dict above the complete list of objects detected initially?

Yes this is the complete list. I mapped 80 categories to 12. If, at any time, you need to take a look at the categories, just open the file category_names.txt, which is under the directory analyze_visual.

tyiannak commented 3 years ago

@lobracost will u send this task for PR or are there any more changes to be done here?

pakoromilas commented 3 years ago

@lobracost will u send this task for PR or are there any more changes to be done here?

I still have to fix some things on the confidences smoothing. The PR will be ready probably tomorrow.

tyiannak commented 3 years ago

right i had forgotten about smoothing again :-)