ros-navigation / navigation2_dynamic

Navigation2's dynamic obstacle detection, tracking, and processing pipelines.
Apache License 2.0
55 stars 15 forks source link

[Improvement] Re-write KF Hungarian Node on C++ to avoid performance pitfalls #30

Open AlexeyMerzlyakov opened 2 years ago

AlexeyMerzlyakov commented 2 years ago

During the writing of simple laser scanner obstacles detector for KF Hungarian node, I've experienced very slow performance of main callback from kf_hungarian_node.py, which is around ~1.3 seconds per each callback call. Typically, laser scanner produces 360 points (obstacles) which makes the following part of code to be the performance bottleneck of O(360*360) complexity:

        cost = np.zeros((num_of_obstacle, num_of_detect))
        for i in range(num_of_obstacle):
            for j in range(num_of_detect):
                cost[i, j] = self.obstacle_list[i].distance(detections[j])

This part executes average in ~1 second on my PC. Allowing the simplification of np.linalg.norm() heuristic to abs(delta_x) + abs(delta_y) + abs(delta_z) decreases the execution time of the part of code above up to ~0.3 seconds which is better, but not enough for real-time processing of laser scanners, operating e.g. at 5-50Hz frequencies.

The simple experiment repeating the part of hot code from above on both Python and C++ interfaces, attached to current ticket as hotcode_check.zip archive. Execution estimates gives ~25x speed increase when switching on C++ for this hot part of the code:

[leha@leha-PC hotcode_check]$ ./a.out 
dt = 0.0115969
[leha@leha-PC hotcode_check]$ ./a.out 
dt = 0.0117751
[leha@leha-PC hotcode_check]$ ./a.out 
dt = 0.0108645
[leha@leha-PC hotcode_check]$ python3 ./check.py 
dt =  0.28881096839904785
[leha@leha-PC hotcode_check]$ python3 ./check.py 
dt =  0.29091525077819824
[leha@leha-PC hotcode_check]$ python3 ./check.py 
dt =  0.28969860076904297

Therefore, I am proposing (and ready to try) to re-write the KF Hungarian node on a C++ which is expected to make callbacks to operate much faster.

The following table shows the migration of base methods for Python->C++ KF node code: Python C++ Comment
do_transform_point() doTransform() TF2 standard function
do_transform_vector3() doTransform() TF2 standard function
scipy.optimize.linear_sum_assignment() ??? Could be re-written from the scratch, use already known Hungarian algorithm implementation, or OR-Tools from Google
cv2.KalmanFilter.predict() const Mat& cv::KalmanFilter::predict(const Mat & control = Mat()) OpenCV Kalman predict
cv2.KalmanFilter.correct() const Mat& cv::KalmanFilter::correct(const Mat &measurement) OpenCV Kalman update

What do you think about such activity?

SteveMacenski commented 2 years ago

These are discrete objects / detections (e.g. lines, human models, etc) not laser scans, so there shouldn't be that many, but you're entirely right that this should be moved to C++ to get performance improvements since this algorithm is naturally going to be pretty slow as it is. There could still be dozens of obstacles being tracked in a given scene.

I think its worth us discussing internally if this is still a sensible thing to work on this year given we've leading into November 1 with about ~7 weeks of useful time remaining in 2022 before the holidays (which I assume we're getting very little done during). I would say though converting this node to C++ would be a nice, discrete, and manageable project that could set us up well for when we revisit this to only need to worry about the detectors. Though it could be that we have other things we can work on instead that are more inline with what we've been doing so far in 2022 so we can start fresh in 2023.

vinnnyr commented 1 year ago

To be honest, I just tried this with some data we had. And it wasn't horribly non-performant, at least on my developer laptop. It took about ~10% of a core. I was expecting something drastically more.

Especially since:

@AlexeyMerzlyakov how many "obstacles" did you have to make it take ~1 second per callback, if I might ask?

AlexeyMerzlyakov commented 1 year ago

Sure. For simple testing I did not use any objects detecting / ML front-end. Instead of this, I've just treated all laser scan points, each as obstacles. So, number of obstacles was around 340-360 per a frame. Yes, this is far from the situation where real obstacles will be detected by some detection front-end (connected to neural networks or something else), but anyway isn't reasonable to start from (even in rooms with lots of objects, where the number might be similar, I believe)? In this case, KF Hungarian algorithm will match nearest points as the same obstacle and then will do the dynamic detection through the Kalman filter.

It took about ~10% of a core.

This is not about CPU usage. Python as a translation language is working slowly, even not eating 100% of CPU. So, in this ticket we are talking about processing time of each frame, rather than CPU/RAM loads. For this amount of obstacles we have something synthetic causing more than >1 second for computing time per each frame. However, it is expected that if we will have 30-50 moving obstacles (in some offices/rooms with many objects) this may still have a huge computation time, slowing real-time processing and might being a bottleneck of this set-up.

@vinnnyr, may I ask you to share, which obstacle detectors are you using, how many obstacles (in average) do you operating with and what processing time per frame was on your system, to understand the extent of the problem in more real applications (comparing that I've been using)?

SteveMacenski commented 1 year ago

even in rooms with lots of objects, where the number might be similar, I believe

Certainly, but also why we have set which classes of interest we track so that we only track moving things.

So I think this makes alot of sense to have in C++. I've seen some forks where people are doing some development based on this repo recently and I think it would be a nice capability to offer even if we haven't formally finished the detectors yet, so I think its worth spending some time this year on a C++ version of this node in particular - for all the reasons @AlexeyMerzlyakov mentions and that it is getting some traction. This will be the foundation we build our dynamic pipeline empire from!

vinnnyr commented 1 year ago

@vinnnyr, may I ask you to share, which obstacle detectors are you using, how many obstacles (in average) do you operating with and what processing time per frame was on your system, to understand the extent of the problem in more real applications (comparing that I've been using)?

So the obstacle detector I am using is a Euclidean Cluster off of a lidar point cloud (with a bunch of preprocessing).

I think in a given frame, we only expect to have ~10 objects max. I haven't yet measured latency (this was just an experiment, not something we are actually running yet), the latency seemed acceptable but obviously more scrutiny is needed from my end.

SteveMacenski commented 4 months ago

Eval: https://github.com/autowarefoundation/autoware.universe/tree/b038006967e336f7415bc3f9bafc0439140fb971/perception/multi_object_tracker