No detections when using ByteTracker

Search before asking

[X] I have searched the Supervision issues and found no similar bug report.

Bug

I am trying to use supervision on a yolo V7. Because the outputs of YOLO V7 are different from what's expect, I created 2 classes to convert the outputs to what's expected. Here's a screenshot of the detections from before and after going through the tracker:

You can see in most cases it will never track the 'car' object and sometimes none of the objects.

Here's the code:

class DetectionResults:
    def __init__(self, boxes, scores, cls, names, masks=None, ids=None):
        self.boxes = DetectionBoxes(boxes, cls, scores, ids)
        self.names = names
        self.masks = masks

class DetectionBoxes:
    def __init__(self, boxes, cls, scores, ids=None):
        self.xyxy = boxes
        self.cls = cls
        self.conf = scores
        self.id = ids

class ObjectDetector:
    def __init__(self, droneId, operationId, userId, sessionId, droneName) -> None:
        // some other parameters
        self.model = os.path.join(
            "/app/waldo", "yolov7-W25_rect_1280_736_newDefaults-bs96-best-topk-200.onnx"
        )
        self.sess, self.input_name = self.init_session()
        self.tracker = sv.ByteTrack()
        self.annotator = sv.BoxAnnotator()
        self.label_annotator = sv.LabelAnnotator()

    ######## Some helper functions ######## 

    def process_frame(self, frame, sess, max_outputs):
        names = [
            "car_og",
            "van",
            "truck",
            "building",
            "person",
            "gastank",
            "digger",
            "container",
            "bus",
            "u_pole",
            "car",
            "bike",
            "smoke",
            "solarpanels",
            "arm",
            "plane",
        ]

        image = frame.copy()
        image = image.transpose((2, 0, 1))
        image = np.expand_dims(image, 0)
        image = np.ascontiguousarray(image)

        im = image.astype(np.float32)
        im /= 255

        inp = {self.input_name: im}
        outputs = sess.run(None, inp)[0]

        # Convert model's output to be compatible with supervisions required input
        boxes = np.stack([outputs[:, 1], outputs[:, 2], outputs[:, 3], outputs[:, 4]], axis=1)
        scores = outputs[:, 6]
        cls_ids = outputs[:, 5].astype(int)

        detection_results = DetectionResults(boxes=boxes, scores=scores, cls=cls_ids, names=names)

        detection_results.boxes.xyxy = torch.tensor(detection_results.boxes.xyxy)
        detection_results.boxes.cls = torch.tensor(detection_results.boxes.cls)
        detection_results.boxes.conf = torch.tensor(detection_results.boxes.conf)

        # Update tracker with detections
        detections = sv.Detections.from_ultralytics(detection_results)
        # When I print the detection classes: detections["class_name"] I can see that the model detects every object on the frame
        detections = self.tracker.update_with_detections(detections)
        # When I print the detection classes here, most of the detections from before are gone

        thickness = 1
        category_counts = {}

        # Initialize lists for storing results
        box_center_points = []
        labels_info = []
        detection_info = []

        # Calculate center points and create labels_info and detection_info in a single loop
        for i in range(len(detections.xyxy)):
            x0, y0, x1, y1 = detections.xyxy[i]
            cls_id = detections.class_id[i]
            tracker_id = detections.tracker_id[i]
            score = detections.confidence[i]

            center_x = (x0 + x1) / 2
            center_y = (y0 + y1) / 2
            box_center_points.append([center_x, center_y])

            class_name = detection_results.names[cls_id]
            print(class_name)
            labels_info.append(f"#{tracker_id} {class_name}")

            detection_info.append({
                "tracking_id": tracker_id,
                "class_name": class_name,
                "center_point": [center_x, center_y]
            })

            name = names[int(cls_id)]
            name += " " + str(score)

            if max_outputs is not None:
                cv2.putText(
                    frame,
                    f"ONNX network max Outputs: {max_outputs}",
                    (frame.shape[1] - 250, 20),
                    cv2.FONT_HERSHEY_SIMPLEX,
                    0.4,
                    (0, 0, 255),
                    1,
                )

            # Get the name of the class without the score
            class_name = name.split()[0]

            # Increment the count for this class in the dictionary
            if class_name in category_counts:
                category_counts[class_name] += 1
            else:
                category_counts[class_name] = 1

        annotated_frame = self.annotator.annotate(frame, detections=detections)
        annotated_frame = self.label_annotator.annotate(annotated_frame, detections=detections, labels=labels_info)

        # Write the category counts on the frame
        y_position = 20  # Initial y position
        for category, count in category_counts.items():
            cv2.putText(
                frame,
                f"{category}: {count}",
                (10, y_position),
                cv2.FONT_HERSHEY_SIMPLEX,
                0.4,
                (255, 255, 255),
                1,
            )
            y_position += 20  # Increment the y position for the next text

        # Return the frame with detection boxes
        return annotated_frame, detection_info

Environment

This is the relevant part of the dockerfile:

FROM python:3.9

# Install required dependencies
RUN apt-get update && apt-get install -y \ 
    netcat-traditional \
    libgl1-mesa-glx && \
    rm -rf /var/lib/apt/lists/*

# Copy the python dependencies
COPY cvn/requirements.txt /app/requirements.txt
# Install Python dependencies
RUN if [ "$(uname)" = "Darwin" ] && [ "$(uname -m)" = "arm64" ]; then \
    pip install onnxruntime-silicon==1.16.0; \
    else \
    pip install onnxruntime-gpu==1.15.1; \
    fi

RUN pip install -r /app/requirements.txt

This is the requirements.txt file:

mysql-connector-python
flask
pytz
requests==2.24.0

# computer vision
opencv-python
torch==2.0.0 # ~4 mins to install
torchvision

# waldo
numpy==1.23.5

# Library with various utilities for computer vision tasks
supervision

# disaster classification
Pillow

# crowd localization
yacs
easydict

Minimal Reproducible Example

The code I shared above should be more than enough for a test drive. Basically after the process frame function returns the annotated frame and the detection info, they get passed into a database and displayed on a platform for the user. Anyway, that's irrelevant to this issue I suppose.

Additional

I tried changing the various parameters of ByteTracker such as:

1) track_activation_threshold 2) lost_track_buffer 3) minimum_matching_threshold 4) frame_rate

But it did not solve the problem. One final thing is, because I am running on CPU right now, the frame rate is really low. Lower than 1 FPS. Does FPS matter for the tracker?

Also I tried grabbing almost all the detections regardless of confidence by doing detections = detections[detections.confidence > 0.05] but it also yielded no improvements. The screenshot below is with the following bytetrack settings:

track_activation_threshold=0.05 lost_track_buffer=60 minimum_matching_threshold = 0.1 (I tried with 0.9) and it was the same frame_rate = 1 (tried with the default as well)

image (1)

On this link https://drive.google.com/file/d/1YbyuUXfnr3N6Ukyc2ODk3t0FVNTk4X2w/view?usp=sharing you will find the weights we used and the video. We are trying to track the boat (I know the label says 'car', ignore that).

For the code, we basically modified the existing code of this repo https://github.com/stephansturges/WALDO/blob/master/playground/run_local_network_on_videos_onnxruntime.py

Are you willing to submit a PR?

[ ] Yes I'd like to help by submitting a PR!

Hi @mariosconsta 👋🏻

Could you reproduce this experiment in Google Colab? Given that you are using a model that is not one of our standard supported models, I would like to simplify the environment setup as much as possible to help you as quickly as possible.

Hi @mariosconsta 👋🏻

Could you reproduce this experiment in Google Colab? Given that you are using a model that is not one of our standard supported models, I would like to simplify the environment setup as much as possible to help you as quickly as possible.

Hey @SkalskiP ! Sorry for taking a bit longer to reply, I am a bit baffled by the results. First, let me share the colab notebook. The tracker in the notebook works.

The only difference between the colab and our codebase is the I/O. On the notebook I am reading the video directly using cap.read(frame), on the other hand, our main code, we read an RTMP stream, save each frame in a database, and send the path for the frame to the detector. Here's a code snippet:

def process_frame(self, frame, sess, max_outputs):
        names = [ #### list with class names]

        image = frame.copy()
        image = image.transpose((2, 0, 1))
        image = np.expand_dims(image, 0)
        image = np.ascontiguousarray(image)

        im = image.astype(np.float32)
        im /= 255

        inp = {self.input_name: im}
        outputs = sess.run(None, inp)[0]

        # Convert model's output to be compatible with supervisions required input
        boxes = np.stack([outputs[:, 1], outputs[:, 2], outputs[:, 3], outputs[:, 4]], axis=1)
        scores = outputs[:, 6]
        cls_ids = outputs[:, 5].astype(int)

        detection_results = DetectionResults(boxes=boxes, scores=scores, cls=cls_ids, names=names)

        detection_results.boxes.xyxy = torch.tensor(detection_results.boxes.xyxy)
        detection_results.boxes.cls = torch.tensor(detection_results.boxes.cls)
        detection_results.boxes.conf = torch.tensor(detection_results.boxes.conf)

        # Update tracker with detections
        detections = sv.Detections.from_ultralytics(detection_results)
        detections = detections[detections.confidence > 0.05]
        print(f'BEFORE TRACKER: {detections["class_name"]}')
        detections = self.tracker.update_with_detections(detections)
        print(f'AFTER TRACKER: {detections["class_name"]}\n\n')

        thickness = 1
        category_counts = {}

        # Initialize lists for storing results
        box_center_points = []
        labels_info = []
        detection_info = []

        # Calculate center points and create labels_info and detection_info in a single loop
        for i in range(len(detections.xyxy)):
            x0, y0, x1, y1 = detections.xyxy[i]
            cls_id = detections.class_id[i]
            tracker_id = detections.tracker_id[i]
            score = detections.confidence[i]

            center_x = (x0 + x1) / 2
            center_y = (y0 + y1) / 2
            box_center_points.append([center_x, center_y])

            class_name = detection_results.names[cls_id]
            labels_info.append(f"#{tracker_id} {class_name}")

            detection_info.append({
                "tracking_id": tracker_id,
                "class_name": class_name,
                "center_point": [center_x, center_y]
            })

            name = names[int(cls_id)]
            name += " " + str(score)

            if max_outputs is not None:
                cv2.putText(
                    frame,
                    f"ONNX network max Outputs: {max_outputs}",
                    (frame.shape[1] - 250, 20),
                    cv2.FONT_HERSHEY_SIMPLEX,
                    0.4,
                    (0, 0, 255),
                    1,
                )

            # Get the name of the class without the score
            class_name = name.split()[0]

            # Increment the count for this class in the dictionary
            if class_name in category_counts:
                category_counts[class_name] += 1
            else:
                category_counts[class_name] = 1

        annotated_frame = self.annotator.annotate(frame, detections=detections)
        annotated_frame = self.label_annotator.annotate(annotated_frame, detections=detections, labels=labels_info)

        # Write the category counts on the frame
        y_position = 20  # Initial y position
        for category, count in category_counts.items():
            cv2.putText(
                frame,
                f"{category}: {count}",
                (10, y_position),
                cv2.FONT_HERSHEY_SIMPLEX,
                0.4,
                (255, 255, 255),
                1,
            )
            y_position += 20  # Increment the y position for the next text

        # Return the frame with detection boxes
        return annotated_frame, detection_info

    def run_inference(self, img_path, output_img_path):
        input_frame_raw = cv2.imread(img_path)
        height, width, channels = input_frame_raw.shape
        expected_width, expected_height = self.get_resolution_from_model_path(
            self.model
        )  # Get expected resolution

        max_outputs = self.get_max_outputs()
        frame = self.resize_and_pad(input_frame_raw, expected_width, expected_height)
        out_frame, detection_info = self.process_frame(
            frame=frame, sess=self.sess, max_outputs=max_outputs
        )

        cv2.imwrite(output_img_path, out_frame)
        frameId = database.queries.saveFrame(self.sessionId, output_img_path)

        # loop through all the detected points and calculate their coordinates
        telemetry = database.queries.getDroneLatestTelemetry(
            self.droneId
        )  # get latest drone telemetry

        # TODO: get fov from database and adjust for zoom level(?)
        fov_horizontal = 68  # FOR MAVIC
        fov_vertical = 40  # mavic

        gimbal_angle = telemetry[4] + 90

        detectedCoords = []
        for info in detection_info:
            lat, lon = self.pixel_to_gps(
                info['center_point'],
                (width, height),
                (fov_horizontal, fov_vertical),
                (telemetry[0], telemetry[1], telemetry[2], telemetry[3], gimbal_angle),
            )
            detectedCoords.append([lat, lon])
            database.queries.saveDetectedObject(lat, lon, info['class_name'], int(info['tracking_id']), 1.2, self.operationId, self.droneId, self.sessionId, frameId)

def start_loop(self):
        output_folder = os.path.join(
            "/media", f"detector_session{self.sessionId}_{self.droneName}"
        )
        os.makedirs(output_folder, exist_ok=True)
        startTime = time.time()
        detectionInterval = 1 / int(
            os.environ.get("COMPUTER_VISION_FPS")
        )*2  # run detection X frames per second
        nextDetectionTime = startTime + detectionInterval
        frameCounter = 0

        while not self.stop:
            currentTime = time.time()
            if currentTime >= nextDetectionTime:
                frameCounter += 1
                nextDetectionTime += detectionInterval

                latestLiveStreamFrame = database.queries.getLatestLiveStreamFrame(
                    self.droneId
                )
                image_path = latestLiveStreamFrame[0]

                # prepare frame filename
                currentDateTime = datetime.now(timezone).time()
                formattedDateTime = currentDateTime.strftime("%H-%M-%S")
                frame_name = f"detector-frame{frameCounter:05d}_{formattedDateTime}.jpg"
                self.run_inference(
                    img_path=image_path,
                    output_img_path=os.path.join(output_folder, frame_name),
                )

        print(f"{self.droneName}: Waldo detector stopped.")

    def stopDetector(self):
        self.stop = True

Sorry for the code dump, I just want to make our workflow "clear". As you can see the only difference between this and the colab notebook is how the video/stream is been processed.

Is this behavior normal? The tracker works with the yolo V7 as you can see from the colab, so the issue must be on how we read and process each frame.

Hi @mariosconsta,

Thanks for sharing the colab notebook! I ran it and like you said it seems to be doing the correct thing. The tracker drops detections that it can't track, which is why a few detections get cut removed after the tracker. But it sounds like the tracker was performing much worse with your original stream. It looks like you have your inference function inside your capture loop. This is most likely the problem because if your system can only process at 1 FPS, then you can only capture video at 1 FPS. The tracker is not designed to handle such low framerates because it relies on the assumption that objects don't move very far in-between two frames. If the source framerate (in this case the 1 FPS of capturing the stream) is too low, the tracker won't be able to track very well.

Is the processing need to happen in real-time? If not you could simply capture in real-time and run detection and tracking offline. The other option is just making the processing faster. The tracker needs at least 15 FPS to perform correctly.

@rolson24 This was my concern as well. On my current setup, the FPS is low, like 1 FPS like I said. When we tested it on a different system with a GPU, it was better, but still not good enough (presumably because of low FPS).

The processing needs to happen in real-time yes, the plan is to detect and track objects in real-time using a drone. One easy solution that I can think of, is to try and use a smaller model, which will hopefully run faster.

@mariosconsta Ya I think a smaller model would help. You could also change the input frame size to the model because the current input of 720x1280 is more than 2x the number of pixels as 640x640, which significantly impacts processing speed. The last thing you can do if you are using an NVIDIA gpu is to export the ONNX model TensorRT, which further optimizes the model for processing speed. If you search for YOLOv7 TensorRT export, I'm sure there will be some examples on how to do it. TensorRT install instructions

@rolson24 Gotcha! Thank you for your time mate, I will go ahead and close the issue, since the problem lies with our implementation and not the library itself.

Have a lovely weekend!

roboflow / supervision