open-rmf / rmf

Root repository for the RMF software
Apache License 2.0
223 stars 57 forks source link

When conducting stability testing on "rmf", "fleet_adapter" crashed abnormally #359

Closed liugehaizaixue closed 11 months ago

liugehaizaixue commented 1 year ago

As shown in the figure, I have modified the airport_terminal map only retains one lane. In the waypoint point of this lane, I have set up 5 charging points, corresponding to 5 robots. I wrote a Python script that sends 3-4 'gotoplace' tasks every 600-800s. This frequency can almost guarantee that the robot will complete the task within the time frame. But after running for a period of time, 'fleet_adapter' always crashes and exits abnormally.(My robot's speed is 0.5m/s. My " tinyrobot_config. yaml " and "airport_termianl. building. yaml" can be seen here) airport_1 airport_2 airport_3

After running the program for a period of time (approximately six or seven hours), the error message is as follows:

error_info

May I ask what are the possible reasons for the abnormal crash of 'fleet_adapter'?

My guess is as follows:

  1. There is a memory leak. However, according to the top command parameters I recorded, although the memory usage increased before the crash of 'fleet_adapter', it was still within a controllable range. At the same time, I noticed the addition of code related to memory reclamation every 5 minutes. The CPU only skyrocketed during traffic conflicts, and it was normal before the "fleet_adapter" crash. At the beginning: start_time When there is a traffic conflict: conflict_time Before Abnormal Crash: end_time

  2. The map setting is unreasonable. As shown in the figure, I am not sure if the airport map can accommodate 5 robots under the same fleet. (According to previous test results, the more robots there are, the faster the "fleet_adapter" crashes.)

  3. Task frequency. I believe that sending 3-4 tasks to 5 robots in 600-800s, with an average of less than one task per robot, should be an acceptable frequency range.

  4. Transportation planning. I don't know whether transportation planning will cause abnormalities. I don't know whether it is possible that the error between the actual speed of the robot and the speed in the config file leads to the inconsistency between the transportation planning table and the actual situation after the robot has been running for a period of time, resulting in a 'fleet_adapter" crashed abnormally. However, it seems that there is an update mechanism in this section that will synchronize. (In an ideal situation, should the yellow dot completely coincide with the purple dot? In traffic conflicts, sometimes there is a significant difference between the yellow dot and the purple dot, and then the yellow dot synchronizes with the purple dot.) I am not familiar with this part, so I am not sure if what I said is correct. This is just my guess.

The above are all my conjectures about the cause of the abnormal crash of "fleet_adapter". I will not clear whether others have encountered similar problems.

mxgrey commented 1 year ago

The map setting is unreasonable. As shown in the figure, I am not sure if the airport map can accommodate 5 robots under the same fleet. (According to previous test results, the more robots there are, the faster the "fleet_adapter" crashes.)

This is likely to be the problem if the crash was caused by an out-of-memory error. How often are you sampling the memory usage of the fleet adapter? I've seen cases where the memory can spike to an overwhelming size within one second.

The memory spike issue is a known problem that can occur when several robots need to get through too narrow of a corridor in different directions at the same time. The combinatorial planning effort to solve that situation can cause the memory consumption of the fleet adapter to rapidly grow to an unreasonable size.

I suspect what's happening in your endurance test is that the robots are eventually given tasks that lump them all around the same corridor and force them to negotiate their ways through it. We've never put much effort into ensuring that the airport demo graph can handle a lot of activity. There are some obvious fixes that can be made, like adding parallel lanes everywhere.

As for preventing the crashes, in our ongoing work I'm making sure there are ways to control how much memory the planning and negotiation is allowed to occupy. We're combining that with a more efficient planning algorithm called Safe Interval Path Planning that can explore all of a mobile robot's affordances with an exponentially smaller bound on the planner's memory footprint.

In the meantime when this kind of problem happens I would take it as an indication that the nav graph needs to be tweaked, especially around choke points.

liugehaizaixue commented 1 year ago

The map setting is unreasonable. As shown in the figure, I am not sure if the airport map can accommodate 5 robots under the same fleet. (According to previous test results, the more robots there are, the faster the "fleet_adapter" crashes.)

This is likely to be the problem if the crash was caused by an out-of-memory error. How often are you sampling the memory usage of the fleet adapter? I've seen cases where the memory can spike to an overwhelming size within one second.

The memory spike issue is a known problem that can occur when several robots need to get through too narrow of a corridor in different directions at the same time. The combinatorial planning effort to solve that situation can cause the memory consumption of the fleet adapter to rapidly grow to an unreasonable size.

I suspect what's happening in your endurance test is that the robots are eventually given tasks that lump them all around the same corridor and force them to negotiate their ways through it. We've never put much effort into ensuring that the airport demo graph can handle a lot of activity. There are some obvious fixes that can be made, like adding parallel lanes everywhere.

As for preventing the crashes, in our ongoing work I'm making sure there are ways to control how much memory the planning and negotiation is allowed to occupy. We're combining that with a more efficient planning algorithm called Safe Interval Path Planning that can explore all of a mobile robot's affordances with an exponentially smaller bound on the planner's memory footprint.

In the meantime when this kind of problem happens I would take it as an indication that the nav graph needs to be tweaked, especially around choke points.

Thank you for your reply.

My sampling frequency for memory data is 30 seconds per time. At the same time, based on your analysis, I will increase the sampling frequency and modify the nav graph appropriately to verify whether this is the cause of the abnormal crash of "fleet_adapter".

mxgrey commented 1 year ago

If you do fix the airport demo's nav graph so that the traffic flows better, it would be an awesome contribution to open a PR for the changes in rmf_demos.

liugehaizaixue commented 1 year ago

I have modified my memory sampling frequency to 0.1s, but I still haven't seen any memory anomalies before the 'fleet_adapter' crashes abnormally.

In recent testing, I found an abnormal crash of 'fleet_adapter', which seems to occur during the task allocation phase and is related to battery power (I am not sure, it may also be during the initial execution phase of the task).

Because I have set up four simulation robots on a small map, and their electricity is sufficient and they are assumed not to be consumed. At this point, when I immediately issued multiple "gotoplace" tasks, they behaved very normally. (Path planning ->Conflict generation ->Avoidance)

However, when I modified the robot's battery level to insufficient and it would consume power, I submitted four tasks. These four tasks began to be auctioned and assigned one by one, and during this process, the 'fleet_adapter' quickly collapsed abnormally.

mxgrey commented 1 year ago

I see, thanks for testing this further. I wonder if there's a bug related to situations where tasks are impossible for the robots due to battery limitations. Such tasks should be rejected during the bidding process, but maybe there's an assumption failing somewhere in the pipeline.

I'd suggest using GDB as described here to get a backtrace of the crash and see what specific line of code is segfaulting.

Alternatively if you can provide instructions on how to recreate the crash using one of the demo worlds, I can get the backtrace with gdb myself.

liugehaizaixue commented 1 year ago

I see, thanks for testing this further. I wonder if there's a bug related to situations where tasks are impossible for the robots due to battery limitations. Such tasks should be rejected during the bidding process, but maybe there's an assumption failing somewhere in the pipeline.

I'd suggest using GDB as described here to get a backtrace of the crash and see what specific line of code is segfaulting.

Alternatively if you can provide instructions on how to recreate the crash using one of the demo worlds, I can get the backtrace with gdb myself.

Thank you for your prompt. After my debugging, I found that my robot's battery level often drops below 0.0 here, causing the program to crash.But I still don't know why the calculated value here is less than 0.0

mxgrey commented 1 year ago

Thanks for tracking that down.

That's very perplexing since this line should catch when your battery has dropped below 0.0.

Are you able to check the value that's being used for battery_threshold? I suppose if it's a negative number or some uninitialized memory then that would explain this leakage. Maybe this line should be

const auto battery_threshold = std::max(constraints.threshold_soc(), 0.0);
liugehaizaixue commented 1 year ago

Thanks for tracking that down.

That's very perplexing since this line should catch when your battery has dropped below 0.0.

Are you able to check the value that's being used for battery_threshold? I suppose if it's a negative number or some uninitialized memory then that would explain this leakage. Maybe this line should be

const auto battery_threshold = std::max(constraints.threshold_soc(), 0.0);

Thank you for your analysis, but it doesn't seem to be caused by battery_threshold. The error seems to have occurred in this line
Because an exception has been thrown here

mxgrey commented 1 year ago

Ah, got it, we just need to test the threshold before we ever attempt to set the battery level in the state object. That will be an easy fix.

Yadunund commented 12 months ago

@liugehaizaixue could you checkout rmf_task to the branch in this PR and test if the problem is fixed? https://github.com/open-rmf/rmf_task/pull/94

liugehaizaixue commented 11 months ago

could you checkout rmf_task to the branch in this PR and test if the problem is fixed?

I have tested the changes made and can confirm that they have successfully resolved the problem I encountered. Thank you for solving this problem.

mxgrey commented 11 months ago

I'll close this issue now since #94 is merged. Thanks for the report and helping us to debug the problem!