Multiple bounding boxes to detect the same object when using webcam or RTSP stream

Chris7911 commented 4 years ago

Before submitting a bug report, please ensure that you are using the latest versions of:

Python
PyTorch
This repository (run git fetch && git status -uno to check and git pull to update)

Your issue must be reproducible on a public dataset (i.e COCO) using the latest version of the repository, and you must supply code to reproduce, or we can not help you.

If this is a custom training question we suggest you include your train*.jpg, test*.jpg and results.png figures.

🐛 Bug

In view-img, there is a bbox within previous one in a row, so it ends up with multiple bboxes to detect the same object when using webcam or RTSP stream. I'm pretty sure there is nothing wrong with iou-thres. Accordingly, this issue started after the commit 6daebd3 on Sep 25, 2019. It seems like something changed in the /utils/datasets.py.

To Reproduce

REQUIRED: Code to reproduce your issue below

python detect.py --source 0

Expected behavior

An object is only detected by a bounding box.

Environment

Test1:

OS: [Ubuntu 18.04]
GPU [2080 Ti]

Test2:

OS: [Windows 10]
GPU [1080 Ti]

Additional context

Add any other context about the problem here.

github-actions[bot] commented 4 years ago

Hello @Chris7911, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Google Colab Notebook, Docker Image, and GCP Quickstart Guide for example environments.

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

glenn-jocher commented 4 years ago

@Chris7911 do these accumulate over multiple frames or does this happen in each frame?

glenn-jocher commented 4 years ago

If you run this command do you also see multiple boxes? This works correctly on MacOS, we do not have any local linux machines to test on unfortunately.

python3 detect.py --source http://112.50.243.8/PLTV/88888888/224/3221225900/1.m3u8

Chris7911 commented 4 years ago

If you run this command do you also see multiple boxes? This works correctly on MacOS, we do not have any local linux machines to test on unfortunately.

python3 detect.py --source http://112.50.243.8/PLTV/88888888/224/3221225900/1.m3u8

Running above command on my MacBook Pro works correctly most of the time, but if you pay attention to see the output, it sometimes shows up the problem I mentioned. After several tests, I found it running on MacOS works much much better than ubuntu 18.04 and Windows 10 in which accumulated boxes on the same object happen almost every second.

Chris7911 commented 4 years ago

@Chris7911 do these accumulate over multiple frames or does this happen in each frame?

It looks like boxes accumulate over multiple frames as shown below:

frame0: 1589859678 6442125

frame1: 1589859678 7199967

frame2: 1589859678 7997818

Thank you for your help!!

glenn-jocher commented 4 years ago

@Chris7911 hmm, this seems to be OS specific then as it is not reproducible on macos.

We are extremely limited on resources so we can not look into this, but if you find the cause of the problem and implement a fix please submit a PR! I will leave this open.

Chris7911 commented 4 years ago

@Chris7911 hmm, this seems to be OS specific then as it is not reproducible on macos.

We are extremely limited on resources so we can not look into this, but if you find the cause of the problem and implement a fix please submit a PR! I will leave this open.

In the function "update" of the class "LoadStreams", we found that you read every 4th frame by tuning the variable n to "4" in order to fit the threads of streams to the thread of detection. Accordingly, since you said it seems to be OS specific, we changed the variable n from "4" to "2" to fix the problem on our Windows 10.

droogg commented 4 years ago

@glenn-jocher Thank you for your incredible work! I want to confirm the network behavior that @Chris7911 described. When outputting a video stream from a webcam, I get a huge number of bounding box for one object, which accumulate over several frames, then are reset and accumulate again. OS: Ubuntu 18.04.4 LTS

I observe this kind of network behavior when using a webcam. Also, this behavior is observed when using the command you proposed for verification:

python3 detect.py --source http://112.50.243.8/PLTV/88888888/224/3221225900/1.m3u8

Screenshot_20200602_122850

droogg commented 4 years ago

Examining your project and the detect.py file, I found that to get frames from the video stream, you use the LoadStreams class from the datasets.py file. Also in datasets.py a LoadWebcam class is implemented. However, I did not find a place in the code where it was used. If change:

if webcam:
  view_img = True
  torch.backends.cudnn.benchmark = True  # set True to speed up constant image size inference
  dataset = LoadStreams(source, img_size=imgsz)

to:

if webcam:
  view_img = True
  torch.backends.cudnn.benchmark = True  # set True to speed up constant image size inference
  dataset = LoadWebcam(source, img_size=imgsz)

and

        for i, det in enumerate(pred):  # detections for image i
            if webcam:  # batch_size >= 1
                p, s, im0 = path[i], '%g: ' % i, im0s[0]

to:

        for i, det in enumerate(pred):  # detections for image i
            if webcam:  # batch_size >= 1
                p, s, im0 = path[i], '%g: ' % i, im0s

I get the correct output as expected: Screenshot_20200602_162611

For my webcam, this fixes the situation. Although it seems that FPS is not the maximum. However, for the HTTP stream, this change is critical, turning the video into a slide show.

An attempt to examine the LoadStreams class indicates that this might be threading.thread related. Is it possible that streams can layer information on each other when outputting to cv2?

Because the debugging of the detector shows the correct change in all data and images.

Unfortunately, I don’t understand well in threading, so I ask perhaps naive or stupid questions.

glenn-jocher commented 4 years ago

@droogg yes, these are simply two different dataloaders. LoadWebcam() is single-thread, while LoadStreams() is multithreaded.

I ran the test code again: python3 detect.py --source http://112.50.243.8/PLTV/88888888/224/3221225900/1.m3u8

It normally works well, but when the stream freezes the boxes do begin to pile up, I'm not sure exactly why. It could be that inference is being run on a loop on older frames in the absence of updates.

LoadStreams() runs inference 100% of the time, irrespective of whether the batch contains new frames or not.

droogg commented 4 years ago

@glenn-jocher Thank you very much for your quick and clear answer! Yes, I understand what you are talking about. Also, your words about output on old frames have some meaning for me. Moreover, each new bounding box appears inside the old one, which indicates that the detector receives not the previous frame, but its modified version, on which the bounding box is already drawn using cv2. The function plot_one_box changes the image itself, not its copy. Therefore, if we pass this function not the image itself, but a copy of it, and draw this copy, then the original will not be changed. Indeed, if I change part of the code:

            if webcam:  # batch_size >= 1
                p, s, im0 = path[i], '%g: ' % i, im0s[0]

to:

            if webcam:  # batch_size >= 1
                p, s, im0 = path[i], '%g: ' % i, im0s[0].copy()

I get output as expected using the LoadStreams class and threads. This solves the output problem for me. Perhaps this will be useful for the project or people who are facing the same problem as me.

It is well known that threads are not an absolutely good solution in all cases. I also notice that the LoadWebcam class works much better on my webcam, and the LoadStreams class works better on the HTTP stream. With what it can be connected? And wouldn’t it be the best option to suggest the user to choose a method of loading a video stream?

glenn-jocher commented 4 years ago

@droogg ah of course, that makes perfect sense! We are modifying the original, and then if it is not replaced, inference runs on the modified image, which also explains the reason the box shifts slightly each new frame. It keeps happening until the stream finally downloads a new frame.

About your last comment, theoretically the multithread solution should have a higher throuput capability, as it loads and preprocesses images in background threads rather than in the main thread, but yes as you see it adds complication as well so in some situations the simple loadwebcam() method is better.

ultralytics / yolov3