Closed mjeronimo closed 2 years ago
@orduno See #808 for some background about this issue and a quick code sample.
@mjeronimo is this still an issue?
It is still an issue. Having too many nodes impacts startup times and seems to overwhelm the current DDS implementations.
Possible fixes to this issue are:
@crdelsey can you provide a few examples in the stack where we use "client nodes", other than costmap spinning, I had a hard time finding any
I think the big one is here in ServiceClient. https://github.com/ros-planning/navigation2/blob/7652518af52f257d097269309986943771d98683/nav2_util/include/nav2_util/service_client.hpp#L38
Then, there's another use in LifecycleManagerClient (https://github.com/ros-planning/navigation2/blob/a095038da9f4e3729679e5d20b39e1d95582fb16/nav2_lifecycle_manager/src/lifecycle_manager_client.cpp#L30).
There may be other places.
Got it, I didn't see that hidden in there. "Fix DDS" I feel like is a never ending battle, I'd say 2 or 3 seem good general long term fixes but 4 seems more practically in our control. I'll need to do more reading on these topics
Any objection to using multithreaded executors? It seems that way we can have these all going at once with one node. Do you think there are applications here sensitive enough (or otherwise not well formatted) for multithreaded processing? Most places have mutex where we might want to use a multithreaded spinner. I wouldn't think it would be too much of a hassle to update the few places required.
@mjeronimo Do you have any updates on this? Let me know even if you have any unfinished work on this.
@mjeronimo Shivang is taking a look at this topic. Carl mentioned that you had a branch and started to work on it that either may be done or had some good progress. Can you give us a run down on where that branch is and what still needs to be completed to support?
@shivaang12 @SteveMacenski Sorry, guys I haven't been on github recently (we're using gitlab internally for my current project). Anyways, @orduno did the most exploration on this topic. He prototyped at least one alternative approach that, I believe, used multi-threaded executors. I've also brought up the topic directly to folks at Open Robotics and they agree that this is an issue (but I don't think there are any plans to allow re-entrant spinning on the current node anytime soon). In general, I think it's a good idea to not proliferate nodes like this because of the discovery traffic introduced and the possibility of dropping messages.
this is an issue (but I don't think there are any plans to allow re-entrant spinning on the current node anytime soon
Can confirm on the TSC side, doesn't seem likely for Foxy (or G-turtle... bigger fish to fry). On my wish list of things for OR people to work on that effect navigation users, this isn't high at the moment.
Awesome, hopefully Carlos can chime in. Ignoring Foxy, I'd like 1 server to 1 node mapping in the next LTS distribution. Multithreaded executors seems to be the lowest-effort clearest way to handle this, though that does create issues with the components API.
The question then becomes: do we value using components or minimizing nodes more? Currently, I think that answer is nodes from how inefficient it is to have multiple. However, that's improving by the week, so I think long term that answer is components.
I am currently making a PR to use callback groups for service clients instead of nodes #2216.
On the Rolling distribution, there are callback groups that can be attach to a node and added to executors. This allows to use of spin_until_future_complete()
while the node is still spinning. It makes nodes multithreaded.
It can be used to execute synchronous service/action call in timer/subscriber callback.
https://github.com/ros-planning/navigation2/pull/2216#issuecomment-793069766
I think this comment gives a good summary of what should be done for each case.
Unfortunately, from deeper looking, each new executor creates a new context (unless one given on construction) therefore , with the new design https://design.ros2.org/articles/Node_to_Participant_mapping.html each context now maps to a DDS participant so this is really no different than just having multiple client nodes.
open question then: if we gave the node
's context to a new separate executor_
on construction, is that the same DDS participant and also can be processed in parallel?
I think this probably deserves a dedicated redesign session to think about the executor and threading options we have available and go through our server list to see what makes the most sense in order to reduce the number of DDS participants and nodes. We should also analyze what actions+services+subscribers exist in each server and add them to separate callback groups if necessary so spinning has a deterministic output to processing the "right" information in each context (not-ROS-context, software one). Right now everywhere are just automatically added to one. Especially for planner / controller / waypoint servers, it may make sense to separate these callback groups and all processing in parallel or spinning a particular one.
The right answer is probably a mix of multi-threaded executors and multiple single-threaded executors depending on the server's needs. Also need to determine the context question above whether we can share that across multiple executor objects to run in parallel but not add additional participants on the network.
I spoke with Chris from OR @clalancette and he gave me the key to my understanding on this -- while a new context is constructed on a new executor, when none is given as an input, it uses the default context which stores a static variable to use https://github.com/ros2/rclcpp/blob/master/rclcpp/src/rclcpp/contexts/default_context.cpp#L23. Therefore, any new executor, node, etc will use that same context regardless, unless you give a context to the object on construction.
Thusly, we can continue this work but the question remains how necessary this is. Essentially, this is a refactoring effort now. We're working on turning N
nodes into 1 node with N
threads as a multi-threaded executor: "whats better: N single threaded exeuctors or 1 multi-threaded executor with N threads"? There's not a clear answer to that, but there shouldn't be insane performance differentials now.
It would be good to do this to refactor and make the best use of ROS2, but it is no longer a critical requirement. I believe we now have sufficient information to make decisions and execute on this task when someone has time to work on it.
I see several nodes that can still be removed.
client_node
: easy, need to use callbackgroup insteadbond_client_node_
: easy if we don't do waitUntilFormed
client_node_
: needed but not used by layers and filtersrclcpp_node_
: Difficult, needed to pub/sub TFI have searched all internal nodes which need be removed,and i will do this work in this summer. Any suggestions are welcome.
Remove rclcpp_node_
in LifecycleNode
rclcpp_node_
in class AmclNode
: https://github.com/ros-planning/navigation2/pull/2483rclcpp_node_
in class Costmap2DROS
: https://github.com/ros-planning/navigation2/pull/2489rclcpp_node_
in class MapSaver
: https://github.com/ros-planning/navigation2/pull/2454rclcpp_node_
in class PlannerServer
: https://github.com/ros-planning/navigation2/pull/2459, https://github.com/ros-planning/navigation2/pull/2480rclcpp_node_
in class ControllerServer
: https://github.com/ros-planning/navigation2/pull/2459, https://github.com/ros-planning/navigation2/pull/2479use_rclcpp_node
in the constructor of LifecycleNode
, update all derived classes : https://github.com/ros-planning/navigation2/pull/2993Other independent cases
client_node_
in class WaypointFollower
: https://github.com/ros-planning/navigation2/pull/2441clinet_node_
in class Costmap2DROS
: https://github.com/ros-planning/navigation2/pull/2489node_
in class LifecycleManagerClient
: https://github.com/ros-planning/navigation2/pull/2469bond_client_node_
in class LifecycleManager
: https://github.com/ros-planning/navigation2/pull/2456Internal nodes that need remain :
client_node_
inclass BtActionServer
client_node_
inclass Nav2Panel
@gezp I saw another of your PRs was merged, is this now good to move forward on the AMCL / costmap PRs?
@SteveMacenski the upstream PR https://github.com/ros2/geometry2/pull/447 is still blocked.
Final PR imminent to merge
There are several nodes in the system that are created and used solely for interfacing to services. These "client nodes" are used because spin_until_future_complete cannot accept a node that is already on an executor. Instead, we can use future.wait_until instead of spin_until_future_complete. This will allow us to remove several of these client nodes in the system, which will hopefully ease discovery congestion.