Closed liugehaizaixue closed 11 months ago
The map setting is unreasonable. As shown in the figure, I am not sure if the airport map can accommodate 5 robots under the same fleet. (According to previous test results, the more robots there are, the faster the "fleet_adapter" crashes.)
This is likely to be the problem if the crash was caused by an out-of-memory error. How often are you sampling the memory usage of the fleet adapter? I've seen cases where the memory can spike to an overwhelming size within one second.
The memory spike issue is a known problem that can occur when several robots need to get through too narrow of a corridor in different directions at the same time. The combinatorial planning effort to solve that situation can cause the memory consumption of the fleet adapter to rapidly grow to an unreasonable size.
I suspect what's happening in your endurance test is that the robots are eventually given tasks that lump them all around the same corridor and force them to negotiate their ways through it. We've never put much effort into ensuring that the airport demo graph can handle a lot of activity. There are some obvious fixes that can be made, like adding parallel lanes everywhere.
As for preventing the crashes, in our ongoing work I'm making sure there are ways to control how much memory the planning and negotiation is allowed to occupy. We're combining that with a more efficient planning algorithm called Safe Interval Path Planning that can explore all of a mobile robot's affordances with an exponentially smaller bound on the planner's memory footprint.
In the meantime when this kind of problem happens I would take it as an indication that the nav graph needs to be tweaked, especially around choke points.
The map setting is unreasonable. As shown in the figure, I am not sure if the airport map can accommodate 5 robots under the same fleet. (According to previous test results, the more robots there are, the faster the "fleet_adapter" crashes.)
This is likely to be the problem if the crash was caused by an out-of-memory error. How often are you sampling the memory usage of the fleet adapter? I've seen cases where the memory can spike to an overwhelming size within one second.
The memory spike issue is a known problem that can occur when several robots need to get through too narrow of a corridor in different directions at the same time. The combinatorial planning effort to solve that situation can cause the memory consumption of the fleet adapter to rapidly grow to an unreasonable size.
I suspect what's happening in your endurance test is that the robots are eventually given tasks that lump them all around the same corridor and force them to negotiate their ways through it. We've never put much effort into ensuring that the airport demo graph can handle a lot of activity. There are some obvious fixes that can be made, like adding parallel lanes everywhere.
As for preventing the crashes, in our ongoing work I'm making sure there are ways to control how much memory the planning and negotiation is allowed to occupy. We're combining that with a more efficient planning algorithm called Safe Interval Path Planning that can explore all of a mobile robot's affordances with an exponentially smaller bound on the planner's memory footprint.
In the meantime when this kind of problem happens I would take it as an indication that the nav graph needs to be tweaked, especially around choke points.
Thank you for your reply.
My sampling frequency for memory data is 30 seconds per time. At the same time, based on your analysis, I will increase the sampling frequency and modify the nav graph appropriately to verify whether this is the cause of the abnormal crash of "fleet_adapter".
If you do fix the airport demo's nav graph so that the traffic flows better, it would be an awesome contribution to open a PR for the changes in rmf_demos
.
I have modified my memory sampling frequency to 0.1s, but I still haven't seen any memory anomalies before the 'fleet_adapter' crashes abnormally.
In recent testing, I found an abnormal crash of 'fleet_adapter', which seems to occur during the task allocation phase and is related to battery power (I am not sure, it may also be during the initial execution phase of the task).
Because I have set up four simulation robots on a small map, and their electricity is sufficient and they are assumed not to be consumed. At this point, when I immediately issued multiple "gotoplace" tasks, they behaved very normally. (Path planning ->Conflict generation ->Avoidance)
However, when I modified the robot's battery level to insufficient and it would consume power, I submitted four tasks. These four tasks began to be auctioned and assigned one by one, and during this process, the 'fleet_adapter' quickly collapsed abnormally.
I see, thanks for testing this further. I wonder if there's a bug related to situations where tasks are impossible for the robots due to battery limitations. Such tasks should be rejected during the bidding process, but maybe there's an assumption failing somewhere in the pipeline.
I'd suggest using GDB as described here to get a backtrace of the crash and see what specific line of code is segfaulting.
Alternatively if you can provide instructions on how to recreate the crash using one of the demo worlds, I can get the backtrace with gdb myself.
I see, thanks for testing this further. I wonder if there's a bug related to situations where tasks are impossible for the robots due to battery limitations. Such tasks should be rejected during the bidding process, but maybe there's an assumption failing somewhere in the pipeline.
I'd suggest using GDB as described here to get a backtrace of the crash and see what specific line of code is segfaulting.
Alternatively if you can provide instructions on how to recreate the crash using one of the demo worlds, I can get the backtrace with gdb myself.
Thank you for your prompt. After my debugging, I found that my robot's battery level often drops below 0.0 here, causing the program to crash.But I still don't know why the calculated value here is less than 0.0
Thanks for tracking that down.
That's very perplexing since this line should catch when your battery has dropped below 0.0.
Are you able to check the value that's being used for battery_threshold
? I suppose if it's a negative number or some uninitialized memory then that would explain this leakage. Maybe this line should be
const auto battery_threshold = std::max(constraints.threshold_soc(), 0.0);
Thanks for tracking that down.
That's very perplexing since this line should catch when your battery has dropped below 0.0.
Are you able to check the value that's being used for
battery_threshold
? I suppose if it's a negative number or some uninitialized memory then that would explain this leakage. Maybe this line should beconst auto battery_threshold = std::max(constraints.threshold_soc(), 0.0);
Thank you for your analysis, but it doesn't seem to be caused by
battery_threshold
. The error seems to have occurred in this line
Because an exception has been thrown here
Ah, got it, we just need to test the threshold before we ever attempt to set the battery level in the state object. That will be an easy fix.
@liugehaizaixue could you checkout rmf_task
to the branch in this PR and test if the problem is fixed?
https://github.com/open-rmf/rmf_task/pull/94
could you checkout
rmf_task
to the branch in this PR and test if the problem is fixed?
I have tested the changes made and can confirm that they have successfully resolved the problem I encountered. Thank you for solving this problem.
I'll close this issue now since #94 is merged. Thanks for the report and helping us to debug the problem!
As shown in the figure, I have modified the airport_terminal map only retains one lane. In the waypoint point of this lane, I have set up 5 charging points, corresponding to 5 robots. I wrote a Python script that sends 3-4 'gotoplace' tasks every 600-800s. This frequency can almost guarantee that the robot will complete the task within the time frame. But after running for a period of time, 'fleet_adapter' always crashes and exits abnormally.(My robot's speed is 0.5m/s. My " tinyrobot_config. yaml " and "airport_termianl. building. yaml" can be seen here)
![airport_3](https://github.com/open-rmf/rmf/assets/103879998/7ba5da46-d650-4962-8684-40a04f4b2bbc)
After running the program for a period of time (approximately six or seven hours), the error message is as follows:
May I ask what are the possible reasons for the abnormal crash of 'fleet_adapter'?
My guess is as follows:
There is a memory leak. However, according to the
When there is a traffic conflict:
Before Abnormal Crash:
![end_time](https://github.com/open-rmf/rmf/assets/103879998/74ce21c8-5014-4aa8-9b4d-8d666f15f7db)
top
command parameters I recorded, although the memory usage increased before the crash of 'fleet_adapter', it was still within a controllable range. At the same time, I noticed the addition of code related to memory reclamation every 5 minutes. The CPU only skyrocketed during traffic conflicts, and it was normal before the "fleet_adapter" crash. At the beginning:The map setting is unreasonable. As shown in the figure, I am not sure if the airport map can accommodate 5 robots under the same fleet. (According to previous test results, the more robots there are, the faster the "fleet_adapter" crashes.)
Task frequency. I believe that sending 3-4 tasks to 5 robots in 600-800s, with an average of less than one task per robot, should be an acceptable frequency range.
Transportation planning. I don't know whether transportation planning will cause abnormalities. I don't know whether it is possible that the error between the actual speed of the robot and the speed in the config file leads to the inconsistency between the transportation planning table and the actual situation after the robot has been running for a period of time, resulting in a 'fleet_adapter" crashed abnormally. However, it seems that there is an update mechanism in this section that will synchronize. (In an ideal situation, should the yellow dot completely coincide with the purple dot? In traffic conflicts, sometimes there is a significant difference between the yellow dot and the purple dot, and then the yellow dot synchronizes with the purple dot.) I am not familiar with this part, so I am not sure if what I said is correct. This is just my guess.
The above are all my conjectures about the cause of the abnormal crash of "fleet_adapter". I will not clear whether others have encountered similar problems.