Added design document for remote and multi-machine launching

mlanting commented 3 years ago

The document has been changed so much and it's been so long since the last time I submitted a PR for multi-machine launch changes I figured it'd be more appropriate just to create a new one.

ros-discourse commented 3 years ago

This pull request has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/ros2-tooling-wg-next-meeting/12545/36

dirk-thomas commented 3 years ago

Some high level comments: while some of the aspects of the document are specific to remote execution:

ExecuteRemote
RemoteSubstitution

several other parts are applicable to a local launch invocation too:

detach the main process and keep the launched processes running as well as continue handling event
`Heartbeat
ProcessStatus
all the services like QueryStatus, Start/StopProcess, Shutdown.

It might be good to separate these to make each invidual part easier to design, implement and review.

emersonknapp commented 3 years ago

Noting that we should call out explicitly the expectation that the LaunchServiceNode must be accessible within the current discoverable DDS network. The machine may be WAN-accessible via SSH but that case won't work because we use topics/services for all controls.

emersonknapp commented 3 years ago

When talking about multi-machine message passing, how does time sync come into play?

"Out of Scope" section in design is useful.

emersonknapp commented 3 years ago

How do we support Mac + Windows? Do we think the SSH process will work out of the box for those platforms?

ros-discourse commented 3 years ago

This pull request has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/ros-2-tsc-meeting-agenda-2020-09-17/16464/1

AlexKaravaev commented 3 years ago

Hi everyone! For me as a ros2 developer, I really do miss that feature. Any timelines on this? I see that it's only a design doc and it's hard to say, but maybe it was included in some roadmap or so

emersonknapp commented 3 years ago

So far there is not a roadmap that I know of. We discussed the design in one of the Tooling Working Group meetings, which is where my above comments came from, but I think the next step is just an updated split design - we identified that this is really 2 separate features that can be independently designed and implemented.

@mlanting should be able to clarify if any proof-of-concept code exists already, I can't remember.

mlanting commented 3 years ago

@AlexKaravaev, @pjreed made a simpler, temporary implementation of multi-machine launching that you can find here: https://github.com/pjreed/launch

I will try to get the next version of the design document out in the next week or so. @roger-strain is currently working on a refactor of the launch system. Once that is complete and this design is approved we can being implementation. I expect it will be a few months before we have anything implemented, but pj's implementation should cover basic use cases.

pjreed commented 3 years ago

Just FYI, my changes in the launch repository (on the multi-machine-launch branch) basically just add in some changes necessary to abstract out the launching mechanism; I have another repository at https://github.com/pjreed/ssh_machine which adds an SSH-based launcher that can be used to launch nodes on remote machines via SSH. The functionality is similar to how ROS1's remote launching mechanism works, but there are a few limitations to it; there's some more documentation and an example in that ssh_machine repo.

AlexKaravaev commented 3 years ago

@mlanting @pjreed wow, thanks ! Will definitely check it out, because I was thinking of implementing similar ssh-based launcher. I understand the limitations and sure there is a need of multi-machine launch in ros2 native support, but still this repo can solve some current problems.

gavanderhoorn commented 3 years ago

@mlanting et al.: this could have been discussed in the previous PR, so if it has, please just RTFM me, but has there been any thought about using existing orchestration frameworks for deployment, configuration and orchestration of multi-machine ROS applications?

I'm not immediately thinking of something like k8s, but perhaps something similar might exist, which is a little less 'heavy' but supports like operations ("infrastructure as code" and all that).

Ian Sherman in his keynote at ROSCon18 posited that "backend distributed systems and IoT" (as he phrased it) might have solutions to challenges we still see as problems. Deploying, starting and coordinating a multi-machine distributed application seems like it could be something they might have solutions or best practices for. It does seem like they might have similar requirements (ie: liveness tracking, monitoring, pushing updates, distributed configuration, failover, etc).

pjreed commented 3 years ago

@AlexKaravaev: I haven't polished up my SSH-based launcher mostly because it was intended to be a stopgap solution that just implemented what we needed until a more robust solution is available, and I feel like some of the limitations (most significantly, the executable paths to nodes being resolved on the local computer before the command is sent to the remote computer) make it inappropriate for merging into the official ROS2 ecosystem at this time. Still, hopefully it works for you; let me know if you have any issues, and if it doesn't take too much work to make it more robust, it might make sense to get it into ROS2 proper until we've got a better system.

@gavanderhoorn: When I made my ssh_machine launcher, actually, one of the considerations I had in mind is that it would be good if the remote launching mechanism wasn't tied to any particular implementation; in theory you could also write, say, a K8sMachine class that extends the launch.machine.Machine class to use k8s for launching instead of SSH.

gavanderhoorn commented 3 years ago

@pjreed wrote:

When I made my ssh_machine launcher, actually, one of the considerations I had in mind is that it would be good if the remote launching mechanism wasn't tied to any particular implementation; in theory you could also write, say, a K8sMachine class that extends the launch.machine.Machine class to use k8s for launching instead of SSH.

the abstraction is nice, but wouldn't that still mean that launch is doing the orchestration? It would be delegating some parts of it to k8s, but that would be it.

My assumption is that it's going to take "the ROS community" quite some time to get to the level of functionality which these existing solutions have already reached. Besides that, it would also seem to be duplication of effort, which is never very nice.

emersonknapp commented 3 years ago

I think overall this looks straightforward, but it still has remnants of the process management node, which shouldn't be necessary for the simple version of this design, right?

mlanting commented 3 years ago

Yeah, I think you're right. I've removed those references and fixed the errors you caught.

ros2 / design

Added design document for remote and multi-machine launching #297