Open cosama opened 5 years ago
I have the same issue. I have entire rosbags with the same BoundingBoxes message published 6 times for every single frame. The only difference is the header, but image_header, bounding_boxes is the same.
I noticed the same behavior. Did you guys find where to "fix" it?
No. I studied the code. Even if no image is being published, bounding box messages still get published. From what I could gather the node waits for images to start processing, but once it receives an image the detection thread is always working and using the freshest image in memory. And the publish thread is always publishing bounding box messages based on the freshest detection available. From what I could find, after the first image, there is no checking if the freshest image and detection have already been processed.
On Wed, 15 May 2019 at 10:16, martinploom notifications@github.com wrote:
I noticed the same behavior. Did you guys find where to "fix" it?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/leggedrobotics/darknet_ros/issues/150?email_source=notifications&email_token=AAID4HRKB3XUPEAXCGSLVP3PVPIGXA5CNFSM4HD5D4N2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVOBR2Q#issuecomment-492574954, or mute the thread https://github.com/notifications/unsubscribe-auth/AAID4HUO7QC6TJHCMFO4BA3PVPIGXANCNFSM4HD5D4NQ .
There are a few of posts concerning this behaviour. I have looked at the code but I do not understand it well. Nevertheless I have managed to implement a 'fix' that appears to work.
I have a 1080ti GPU and I found that identical bounding boxes were being published around 5 times for each video frame (camera running at about 10fps). GPU usage was about 87%. After implementing the fix bounding boxes are published at the frame rate from distinct frames. GPU usage dropped to 26% or less. GPU Memory usage unchanged.
I am sure that a much more elegant solution could be produced by somebody who actually understood the code. I have never published code on Github and am not sure what I need to do to release the fix for inspection by a wider audience. If anybody is interested I can send them the code.
@elpidiovaldez. Great to hear you managed to find a way of fixing this. Could you submit a pull request with your changes?
@cosama the problem is I don't understand how to do a pull request. I have never used github for anything except downloads. I just read the docs again and it seems like I have to create a branch first. I could not see how to do that. Frankly it is all double dutch unless you have some experience of collaborative software. In time I will work it out, but I don't have much spare time at the moment.
@elpidiovaldez Thanks for the reply. You would have to create a fork, clone the repo that is now on your own github page, add the fix and then commit/push it back to your own fork.. Then you could create a pull request. I know that can be confusing at first.
If it is easier you can also just upload/describe the important lines you changed and we can see how to move it into this repo.
@cosama right now it is much easier for me just to give changed lines. All changes are in YoloObjectDetector.cpp.
I use a #define to control whether my code is included, so it is easy to compile with and without modifications and to compare the GPU usage and results. Add the #define near the start of YoloObjectDetector.cpp
// Check for xServer
#include <X11/Xlib.h>
#define UNIQUE_FRAMES
Now insert a piece of code at the start of 3 functions. In all cases the code simply causes the thread to terminate without doing anything if the video frame has already been processed:
void *YoloObjectDetector::detectInThread()
{
#ifdef UNIQUE_FRAMES
static int prevSeq;
if (prevSeq==headerBuff_[(buffIndex_ + 2) % 3].seq) {
return 0;
}
prevSeq = headerBuff_[(buffIndex_ + 2) % 3].seq;
#endif
void *YoloObjectDetector::displayInThread(void *ptr)
{
#ifdef UNIQUE_FRAMES
static int prevSeq;
if (prevSeq==headerBuff_[(buffIndex_ + 1)%3].seq) {
return 0;
}
prevSeq = headerBuff_[(buffIndex_ + 1)%3].seq;
#endif
void *YoloObjectDetector::publishInThread()
{
#ifdef UNIQUE_FRAMES
static int prevSeq;
if (prevSeq==headerBuff_[(buffIndex_ + 1)%3].seq) {
return 0;
}
prevSeq = headerBuff_[(buffIndex_ + 1)%3].seq;
#endif
That is enough to stop the same video frame being processed multiple times, but the frame rate estimate is now far too high. The following fixes that. In the function body of YoloObjectDetector::yolo() you need to insert the conditionally compiled code in the final while loop as follows (I am showing the whole loop to give context):
while (!demoDone_) {
buffIndex_ = (buffIndex_ + 1) % 3;
fetch_thread = std::thread(&YoloObjectDetector::fetchInThread, this);
detect_thread = std::thread(&YoloObjectDetector::detectInThread, this);
if (!demoPrefix_) {
#ifdef UNIQUE_FRAMES
static int prevSeq;
if (prevSeq!=headerBuff_[buffIndex_].seq) {
fps_ = 1./(what_time_is_it_now() - demoTime_);
demoTime_ = what_time_is_it_now();
prevSeq = headerBuff_[buffIndex_].seq;
}
#else
fps_ = 1./(what_time_is_it_now() - demoTime_);
demoTime_ = what_time_is_it_now();
#endif
if (viewImage_) {
displayInThread(0);
}
publishInThread();
} else {
char name[256];
sprintf(name, "%s_%08d", demoPrefix_, count);
save_image(buff_[(buffIndex_ + 1) % 3], name);
}
fetch_thread.join();
detect_thread.join();
++count;
if (!isNodeRunning()) {
demoDone_ = true;
}
}
I worked out this fix to enable my own project to succeed. It attacks the symptom which is the same video data being processed repeatedly. I must emphasize that I have not fully understood the way that threading is being used and that it is likely that the author of the original code (or someone that understands exactly what is going on) could do something far more elegant which attacks the root cause of the problem.
Hey,
I was able to figure it out. You can find it at this branch of my fork:
https://github.com/kjbilton/darknet_ros/tree/fix/pubrate
In the original code, in the while loop of the yolo
method, detectInThread
is called onto its own thread, and detect_thread.join
isn't called until after publishInThread
is called. This means that publishInThread
can be called before new detections are made in detectInThread
, which means that detections made in a previous iteration are being published, which results in the behavior seen above.
One way to fix this (as done in the branch I linked above) is to call fetch_thread.join
and detect_thread.join
before calling publishInThread
. Doing this seems to fix things for me.
Note that the branch I linked above is a branch off of another fix I made, which I'm waiting to get pulled in, so you may also want to take a look at that.
@kjbilton Hi, I like your insight into what is going on in the threading, and your code modification is much more elegant than the hack I came up with. Nonetheless I just did a performance comparison between the two implementations using my live camera feed (at about 10 fps) and found this:
Yours: GPU Usage=73%, Displayed Yolo FPS=37, Camera FPS=9.89, rostopic hz boundingboxes=37.5
Mine: GPU Usage=27%, Displayed Yolo FPS=10, Camera FPS=10.0, rostopic hz boundingboxes=9.5
I ran the system with your code and looked at the sequence numbers in the bounding boxes:
rostopic echo /darknet_ros/bounding_boxes | grep seq
I got:
seq: 3053
seq: 858
seq: 3054
seq: 858
seq: 3055
seq: 858
seq: 3056
seq: 858
seq: 3057
seq: 859
seq: 3058
seq: 859
seq: 3059
seq: 859
seq: 3060
seq: 859
seq: 3061
seq: 860
seq: 3062
seq: 860
seq: 3063
seq: 860
seq: 3064
seq: 860
seq: 3065
seq: 861
seq: 3066
seq: 861
seq: 3067
seq: 861
seq: 3068
seq: 861
The sequence numbers in the 3000s are from the message header and all are unique. The sequence numbers in the 800s are from the image_header and show 4 repetitions from the same frame.
My conclusion is that your fix is not preventing video frames from being processed multiple times. I may be missing something. Please could you try the rostopic line to see if you get the same result as me.
@fiorano10 many thanks for doing the pull request. Did you try either of the suggested fixes ? How did they work out for you ?
Hey @elpidiovaldez,
Thanks for trying that out. While I think what I said above still holds, I was mistaken in thinking it addressed the issue. But, I've had some more insight into the problem which agrees with your solution. Here's what's going on:
camImageCopy_
in the callback.yolo()
,
fetchInThread
fetches camImageCopy_
and puts it into the circular buffer buff_
.detectInThread
performs detection on the image one index ahead of the new image put into the buffer.As you point out, the image is being processed multiple times. This means that camImageCopy_
is being fetched multiple times and stored in multiple indices in buff_
. This can happen when the loop in yolo()
containing fetchInThread
and detectInThread
iterates faster than images are coming in. In my case, I have multiple cameras with an effective input rate of >100Hz, so my previous suggestion had me thinking I fixed it. :)
I think one elegant way around this is is to push newly-acquired images in the image callback to a queue data structure and dequeue them after an image has been processed. If the queue is empty, you can skip the relevant parts of code (similar to what you've done). I suggest a queue since it would require replacing only a single variable (camImageCopy_
) without introducing additional flags to keep track of. But also note that if the input rate is much greater than the throughput of the detector (as in my current case), the queue can keep growing, so a size limit would be needed (e.g., max size of 1).
@elpidiovaldez @kjbilton Did you guys notice high CPU usage as well? After applying the fix the FPS is down to the camera frame rate but my CPU usage for each node is > 100% while the GPU usage is ~50%. Any ideas on how to fix this?
@kjbilton @fiorano Hi. I thought that @kjbilton might have not seen multiple frames due to having a gpu that is slow relative to the camera frame rate. I have the reverse case - a slow camera and storming gpu. He seems to have a real insight into the code now.
I like the proposed scheme except for one thing. It appears to ensure that all frames are processed by queuing them. I believe some real-time systems might prefer to skip old frames, and always use the most recent image. I am guessing that a small circular buffer is used for just this reason, so that old frames are overwritten.
I am really sorry but I cannot really contribute at the moment. In 5 hours I am off to the back of beyond for the Summer holidays, where I hardly even have internet access.
@vivkr I finally got round to trying your code. Sorry to take so long - I was on holiday all summer without ANY internet connection.
I downloaded your fork and can confirm that it seems to be about as efficient in GPU usage as my fix (around 27% according to nvidia_smi). Frame rate is published correctly as 10 fps (which is the true rate that frames arrive from my camera).
Since your fix seems to be much simpler than mine (just 1 line as far as I can see) we should definitely go with that.
However I did experience one problem with your fork. The video displayed by darknet did not show the marked-up bounding boxes. This seems to be because an extra parameter, 'ms' has recently been added to the function show_image in the darknet code.
I have a simple fix here. Please let me know if this fixes your issue @elpidiovaldez and a pull request can be made.
If you're still having problem with CPU usage, you could try and use my fork which uses the latest darknet version to see if that helps.
Thank you for your work for this. Your changes are very good, but darknet_ros was updated about two weeks ago, adding the counting function, and the Yolo code is different from yours, may not be completed by adding a line of code, or I did not understand clearly. Can you update your code to the latest?
Any update on this issue? I've tried making the change to insert an if statement checking the header sequence, before publishing, on the latest version but to no effect.
I think better approach would be to use thread signaling mechanism (Semaphore or something similar) rather than return early in loop as proposed in https://github.com/leggedrobotics/darknet_ros/pull/169 since this would only resolve wasted GPU usage and multiple processing of same image. But still would cause CPU thread to run without yielding causing High CPU usage
@vivkr solution applied to latest code. As pointed out, bounding boxes no longer appear on image visually, but I am seeing them in the boundboxes ros topic.
line 488 in YoloObjectDetector.cpp
// Fix0: keep track of what std_msgs::Header id this is (consecutively increasing)
int prevSeq_ = 0;
while (!demoDone_) {
buffIndex_ = (buffIndex_ + 1) % 3;
fetch_thread = std::thread(&YoloObjectDetector::fetchInThread, this);
// Fix1: check this isn't an image already seen
if (prevSeq_ != headerBuff_[buffIndex_].seq)
{
// Fix2: only detect if this is an image we haven't see before
detect_thread = std::thread(&YoloObjectDetector::detectInThread, this);
if (!demoPrefix_) {
fps_ = 1. / (what_time_is_it_now() - demoTime_);
demoTime_ = what_time_is_it_now();
if (viewImage_) {
displayInThread(0);
} else {
generate_image(buff_[(buffIndex_ + 1) % 3], ipl_);
}
publishInThread();
} else {
char name[256];
sprintf(name, "%s_%08d", demoPrefix_, count);
save_image(buff_[(buffIndex_ + 1) % 3], name);
}
// Fix3: increment the new sequence number to avoid detecting more than once
prevSeq_ = headerBuff_[buffIndex_].seq;
fetch_thread.join();
detect_thread.join();
++count;
}
else
{
// Fix4: no detection made, so let thread execution complete so that it can be destroyed safely
fetch_thread.join();
}
if (!isNodeRunning()) {
demoDone_ = true;
}
}
@mbjelonic do maintainers see a better way to avoid running detection more than once on a single image? By blocking repetitive detections on the same image, it helps reduce GPU utilization significantly.
Example: with the above fix, my NVidia quadro RTX 3000 went from running this at 100+ FPS to the frame rate of my camera (configured to just 1 FPS!). I now can use my GPU for other things!
I have observed that relatively often the same
BoundingBoxes
message is published multiple times.I input images at roughly 15Hz and get bounding boxes out at about 20Hz, so it should be easy to check this behavior just with
rostopic hz /camera/rgb/image_raw
androstopic hz /darknet_ros/bounding_boxes
.I don't really understand why but I think it has to do with threading in
YoloObjectDetector.cpp
invoid YoloObjectDetector::yolo()
.void *YoloObjectDetector::publishInThread()
seems fine.The information in the bounding boxes are identical, except for the
Header
: