tryolabs / norfair

Lightweight Python library for adding real-time multi-object tracking to any detector.
https://tryolabs.github.io/norfair/
BSD 3-Clause "New" or "Revised" License
2.34k stars 237 forks source link

Suggestions to make the processing faster #307

Closed utility-aagrawal closed 4 months ago

utility-aagrawal commented 4 months ago

Hi,

I am using this library to track faces. So far, I have been processing every frame of my videos and although results look great, the process is extremely slow. I have tried skipping frames but that degrades the tracking performance a lot. I have tried suggestions provided in the issue #301 but results don't look good so far. I was wondering if you have any other recommendations for me. Thanks!

aguscas commented 4 months ago

Hello!

Have you tried first timing several different parts of your code to identify where the bottleneck might be? From what I understand by your previous issues, you are using several things, such as detectors, embedding models for reid, motion estimators, etc. If you manage to do that maybe we can provide better recommendations.

Also, if it is no problem for you, could you provide a (maybe brief) version of your code, so that we know which things are you actually using from norfair?

utility-aagrawal commented 4 months ago

Thanks for the quick turnaround, @aguscas ! Let me time each of my modules and I'll get back to you shortly. and yes, I can share the code with you.

utility-aagrawal commented 4 months ago

Hi @aguscas , I timed different parts of my code and realized that detections/embeddings are taking the most amount of time. I am using Retinaface for detections and Facenet512 for embeddings. Currently, embeddings consume ~80% of the total time and detections consume ~15% of the total time.

Here's my code:

from deepface import DeepFace import numpy as np
from norfair import Detection, Tracker, Video, get_cutout, draw_boxes, draw_points from norfair.filter import OptimizedKalmanFilterFactory from norfair.camera_motion import MotionEstimator from scipy.spatial.distance import cosine

def minimum_embedding_distance(matched_not_init_trackers, unmatched_trackers): list_of_snd_embedding = [] list_of_fst_embedding = []

# get the embeddings of the unmatched_trackers
if unmatched_trackers.last_detection.embedding is not None:
    list_of_snd_embedding.append(unmatched_trackers.last_detection.embedding)
for detection in unmatched_trackers.past_detections:
    if detection.embedding is not None:
        list_of_snd_embedding.append(detection.embedding)

if len(list_of_snd_embedding)==0:
    return 1

# get the embeddings of the matched_not_init_trackers
if matched_not_init_trackers.last_detection.embedding is not None:
    list_of_fst_embedding.append(matched_not_init_trackers.last_detection.embedding)
for detection in matched_not_init_trackers.past_detections:
    if detection.embedding is not None:
        list_of_fst_embedding.append(detection.embedding)

if len(list_of_fst_embedding)==0:
    return 1

# compare all the embeddings
distances = []
for embedding1 in list_of_fst_embedding:
    for embedding2 in list_of_snd_embedding:
        distances.append(1 - cosine(embedding1, embedding2))

# take the minimum (you could take the average with np.mean instead)
return np.min(np.array(distances))

def detect_faces(frame):

Detects faces using retinaface thru DeepFace library

return ""

def retinaface_detections_to_norfair_detections(retinaface_detections):

Converts retineface detections to norfair Detection objects

return ""

def main( embed_model: str = "Facenet512", skip_period: int = 1, border_size: int = 10, track_points = "bbox", ):

video = Video(input_path="", output_path="")

DISTANCE_THRESHOLD_BBOX: float = 0.5
DISTANCE_THRESHOLD_CENTROID: int = 30
MAX_DISTANCE: int = 10000

distance_function = "iou" if track_points == "bbox" else "euclidean"

distance_threshold = (
    DISTANCE_THRESHOLD_BBOX
    if track_points == "bbox"
    else DISTANCE_THRESHOLD_CENTROID
)

tracker = Tracker(
    initialization_delay=3,
    distance_function=distance_function,
    hit_counter_max=10,
    filter_factory=OptimizedKalmanFilterFactory(),
    distance_threshold=distance_threshold,
    past_detections_length=5,
    reid_distance_function=minimum_embedding_distance,
    reid_distance_threshold=0.5,
    reid_hit_counter_max=np.inf,
)
motion_estimator = MotionEstimator()

for i, cv2_frame in enumerate(video):
    if i % skip_period == 0:
        retinaface_detections = detect_faces(cv2_frame)
        detections = retinaface_detections_to_norfair_detections(
            retinaface_detections, track_points=track_points
        )

        frame = cv2_frame.copy()

        # here I am generating the mask from the detections (you can also use the tracked_object if you want)
        mask = np.ones(frame.shape[:2], frame.dtype)
        for d in detections:
            bbox = d.points.astype(int)
            mask[bbox[0, 1] : bbox[1, 1], bbox[0, 0] : bbox[1, 0]] = 0

        # here I am passing that mask to the motion estimator
        coord_transformation = motion_estimator.update(frame) #, mask)

        for detection in detections:
            cut = get_cutout(detection.points, frame)
            if cut.shape[0] > 0 and cut.shape[1] > 0:
                detection.embedding = DeepFace.represent(img_path = cut, model_name = embed_model, enforce_detection = False, detector_backend = "retinaface")[0]["embedding"]
            else:
                detection.embedding = None

        tracked_objects = tracker.update(detections=detections, period=skip_period, coord_transformations=coord_transformation)

    else:
        tracked_objects = tracker.update()

    if track_points == "bbox": 
        draw_boxes(cv2_frame, tracked_objects, draw_ids = True)
    else:
        draw_points(cv2_frame, tracked_objects)
    frame_with_border = np.ones(
        shape=(
            cv2_frame.shape[0] + 2 * border_size,
            cv2_frame.shape[1] + 2 * border_size,
            cv2_frame.shape[2],
        ),
        dtype=cv2_frame.dtype,
    )
    frame_with_border *= 254
    frame_with_border[
        border_size:-border_size, border_size:-border_size
    ] = cv2_frame
    video.write(frame_with_border)

if name == "main": main()

Since this doesn't look like a norfair issue, I am not sure if you could help. If you still have any suggestions, let me know. Thanks!

aguscas commented 4 months ago

Sorry for my late response. Given that 95% of your time is consumed by your models (80% embeddings + 15% detection), and not related to norfair, there is not much I can say other than changing your models (especially the one for embeddings), but that can come with a hit in the overall performance (accuracy, etc). Are you also making use of a GPU or are you running your models on a CPU?

utility-aagrawal commented 4 months ago

No worries and thanks for your response, @aguscas ! Yes, I am using an NVIDIA GPU with 16 GB VRAM. I am trying to find other efficient embedding models now :)