Closed mariosconsta closed 1 month ago
Hi @mariosconsta 👋🏻
Could you reproduce this experiment in Google Colab? Given that you are using a model that is not one of our standard supported models, I would like to simplify the environment setup as much as possible to help you as quickly as possible.
Hi @mariosconsta 👋🏻
Could you reproduce this experiment in Google Colab? Given that you are using a model that is not one of our standard supported models, I would like to simplify the environment setup as much as possible to help you as quickly as possible.
Hey @SkalskiP ! Sorry for taking a bit longer to reply, I am a bit baffled by the results. First, let me share the colab notebook. The tracker in the notebook works.
The only difference between the colab and our codebase is the I/O. On the notebook I am reading the video directly using cap.read(frame), on the other hand, our main code, we read an RTMP stream, save each frame in a database, and send the path for the frame to the detector. Here's a code snippet:
def process_frame(self, frame, sess, max_outputs):
names = [ #### list with class names]
image = frame.copy()
image = image.transpose((2, 0, 1))
image = np.expand_dims(image, 0)
image = np.ascontiguousarray(image)
im = image.astype(np.float32)
im /= 255
inp = {self.input_name: im}
outputs = sess.run(None, inp)[0]
# Convert model's output to be compatible with supervisions required input
boxes = np.stack([outputs[:, 1], outputs[:, 2], outputs[:, 3], outputs[:, 4]], axis=1)
scores = outputs[:, 6]
cls_ids = outputs[:, 5].astype(int)
detection_results = DetectionResults(boxes=boxes, scores=scores, cls=cls_ids, names=names)
detection_results.boxes.xyxy = torch.tensor(detection_results.boxes.xyxy)
detection_results.boxes.cls = torch.tensor(detection_results.boxes.cls)
detection_results.boxes.conf = torch.tensor(detection_results.boxes.conf)
# Update tracker with detections
detections = sv.Detections.from_ultralytics(detection_results)
detections = detections[detections.confidence > 0.05]
print(f'BEFORE TRACKER: {detections["class_name"]}')
detections = self.tracker.update_with_detections(detections)
print(f'AFTER TRACKER: {detections["class_name"]}\n\n')
thickness = 1
category_counts = {}
# Initialize lists for storing results
box_center_points = []
labels_info = []
detection_info = []
# Calculate center points and create labels_info and detection_info in a single loop
for i in range(len(detections.xyxy)):
x0, y0, x1, y1 = detections.xyxy[i]
cls_id = detections.class_id[i]
tracker_id = detections.tracker_id[i]
score = detections.confidence[i]
center_x = (x0 + x1) / 2
center_y = (y0 + y1) / 2
box_center_points.append([center_x, center_y])
class_name = detection_results.names[cls_id]
labels_info.append(f"#{tracker_id} {class_name}")
detection_info.append({
"tracking_id": tracker_id,
"class_name": class_name,
"center_point": [center_x, center_y]
})
name = names[int(cls_id)]
name += " " + str(score)
if max_outputs is not None:
cv2.putText(
frame,
f"ONNX network max Outputs: {max_outputs}",
(frame.shape[1] - 250, 20),
cv2.FONT_HERSHEY_SIMPLEX,
0.4,
(0, 0, 255),
1,
)
# Get the name of the class without the score
class_name = name.split()[0]
# Increment the count for this class in the dictionary
if class_name in category_counts:
category_counts[class_name] += 1
else:
category_counts[class_name] = 1
annotated_frame = self.annotator.annotate(frame, detections=detections)
annotated_frame = self.label_annotator.annotate(annotated_frame, detections=detections, labels=labels_info)
# Write the category counts on the frame
y_position = 20 # Initial y position
for category, count in category_counts.items():
cv2.putText(
frame,
f"{category}: {count}",
(10, y_position),
cv2.FONT_HERSHEY_SIMPLEX,
0.4,
(255, 255, 255),
1,
)
y_position += 20 # Increment the y position for the next text
# Return the frame with detection boxes
return annotated_frame, detection_info
def run_inference(self, img_path, output_img_path):
input_frame_raw = cv2.imread(img_path)
height, width, channels = input_frame_raw.shape
expected_width, expected_height = self.get_resolution_from_model_path(
self.model
) # Get expected resolution
max_outputs = self.get_max_outputs()
frame = self.resize_and_pad(input_frame_raw, expected_width, expected_height)
out_frame, detection_info = self.process_frame(
frame=frame, sess=self.sess, max_outputs=max_outputs
)
cv2.imwrite(output_img_path, out_frame)
frameId = database.queries.saveFrame(self.sessionId, output_img_path)
# loop through all the detected points and calculate their coordinates
telemetry = database.queries.getDroneLatestTelemetry(
self.droneId
) # get latest drone telemetry
# TODO: get fov from database and adjust for zoom level(?)
fov_horizontal = 68 # FOR MAVIC
fov_vertical = 40 # mavic
gimbal_angle = telemetry[4] + 90
detectedCoords = []
for info in detection_info:
lat, lon = self.pixel_to_gps(
info['center_point'],
(width, height),
(fov_horizontal, fov_vertical),
(telemetry[0], telemetry[1], telemetry[2], telemetry[3], gimbal_angle),
)
detectedCoords.append([lat, lon])
database.queries.saveDetectedObject(lat, lon, info['class_name'], int(info['tracking_id']), 1.2, self.operationId, self.droneId, self.sessionId, frameId)
def start_loop(self):
output_folder = os.path.join(
"/media", f"detector_session{self.sessionId}_{self.droneName}"
)
os.makedirs(output_folder, exist_ok=True)
startTime = time.time()
detectionInterval = 1 / int(
os.environ.get("COMPUTER_VISION_FPS")
)*2 # run detection X frames per second
nextDetectionTime = startTime + detectionInterval
frameCounter = 0
while not self.stop:
currentTime = time.time()
if currentTime >= nextDetectionTime:
frameCounter += 1
nextDetectionTime += detectionInterval
latestLiveStreamFrame = database.queries.getLatestLiveStreamFrame(
self.droneId
)
image_path = latestLiveStreamFrame[0]
# prepare frame filename
currentDateTime = datetime.now(timezone).time()
formattedDateTime = currentDateTime.strftime("%H-%M-%S")
frame_name = f"detector-frame{frameCounter:05d}_{formattedDateTime}.jpg"
self.run_inference(
img_path=image_path,
output_img_path=os.path.join(output_folder, frame_name),
)
print(f"{self.droneName}: Waldo detector stopped.")
def stopDetector(self):
self.stop = True
Sorry for the code dump, I just want to make our workflow "clear". As you can see the only difference between this and the colab notebook is how the video/stream is been processed.
Is this behavior normal? The tracker works with the yolo V7 as you can see from the colab, so the issue must be on how we read and process each frame.
Hi @mariosconsta,
Thanks for sharing the colab notebook! I ran it and like you said it seems to be doing the correct thing. The tracker drops detections that it can't track, which is why a few detections get cut removed after the tracker. But it sounds like the tracker was performing much worse with your original stream. It looks like you have your inference function inside your capture loop. This is most likely the problem because if your system can only process at 1 FPS, then you can only capture video at 1 FPS. The tracker is not designed to handle such low framerates because it relies on the assumption that objects don't move very far in-between two frames. If the source framerate (in this case the 1 FPS of capturing the stream) is too low, the tracker won't be able to track very well.
Is the processing need to happen in real-time? If not you could simply capture in real-time and run detection and tracking offline. The other option is just making the processing faster. The tracker needs at least 15 FPS to perform correctly.
@rolson24 This was my concern as well. On my current setup, the FPS is low, like 1 FPS like I said. When we tested it on a different system with a GPU, it was better, but still not good enough (presumably because of low FPS).
The processing needs to happen in real-time yes, the plan is to detect and track objects in real-time using a drone. One easy solution that I can think of, is to try and use a smaller model, which will hopefully run faster.
@mariosconsta Ya I think a smaller model would help. You could also change the input frame size to the model because the current input of 720x1280 is more than 2x the number of pixels as 640x640, which significantly impacts processing speed. The last thing you can do if you are using an NVIDIA gpu is to export the ONNX model TensorRT, which further optimizes the model for processing speed. If you search for YOLOv7 TensorRT export, I'm sure there will be some examples on how to do it. TensorRT install instructions
@rolson24 Gotcha! Thank you for your time mate, I will go ahead and close the issue, since the problem lies with our implementation and not the library itself.
Have a lovely weekend!
Search before asking
Bug
I am trying to use supervision on a yolo V7. Because the outputs of YOLO V7 are different from what's expect, I created 2 classes to convert the outputs to what's expected. Here's a screenshot of the detections from before and after going through the tracker:
You can see in most cases it will never track the 'car' object and sometimes none of the objects.
Here's the code:
Environment
This is the relevant part of the dockerfile:
This is the requirements.txt file:
Minimal Reproducible Example
The code I shared above should be more than enough for a test drive. Basically after the process frame function returns the annotated frame and the detection info, they get passed into a database and displayed on a platform for the user. Anyway, that's irrelevant to this issue I suppose.
Additional
I tried changing the various parameters of ByteTracker such as:
1) track_activation_threshold 2) lost_track_buffer 3) minimum_matching_threshold 4) frame_rate
But it did not solve the problem. One final thing is, because I am running on CPU right now, the frame rate is really low. Lower than 1 FPS. Does FPS matter for the tracker?
Also I tried grabbing almost all the detections regardless of confidence by doing detections = detections[detections.confidence > 0.05] but it also yielded no improvements. The screenshot below is with the following bytetrack settings:
track_activation_threshold=0.05 lost_track_buffer=60 minimum_matching_threshold = 0.1 (I tried with 0.9) and it was the same frame_rate = 1 (tried with the default as well)
On this link https://drive.google.com/file/d/1YbyuUXfnr3N6Ukyc2ODk3t0FVNTk4X2w/view?usp=sharing you will find the weights we used and the video. We are trying to track the boat (I know the label says 'car', ignore that).
For the code, we basically modified the existing code of this repo https://github.com/stephansturges/WALDO/blob/master/playground/run_local_network_on_videos_onnxruntime.py
Are you willing to submit a PR?